Predicting Google closures

Analyzing predictors of Google abandoning products; predicting future shutdowns
statistics, archiving, predictions, R, survival-analysis, Google
2013-03-282019-04-04 finished certainty: likely importance: 7


Prompted by the shut­down of Google Read­er, I pon­der the evanes­cence of on­line ser­vices and won­der what is the risk of them dis­ap­pear­ing. I col­lect data on 350 Google prod­ucts launched be­fore March 2013, look­ing for vari­ables pre­dic­tive of mor­tal­ity (web hits, ser­vice vs soft­ware, com­mer­cial vs free, FLOSS, so­cial net­work­ing, and in­ter­nal vs ac­quired). Shut­downs are un­evenly dis­trib­uted over the cal­en­dar year or Google’s his­to­ry. I use lo­gis­tic re­gres­sion & sur­vival analy­sis (which can deal with right-cen­sor­ship) to model the risk of shut­down over time and ex­am­ine cor­re­lates. The lo­gis­tic re­gres­sion in­di­cates so­cial­ness, ac­qui­si­tions, and lack of web hits pre­dict be­ing shut down, but the re­sults may not be right. The sur­vival analy­sis finds a me­dian lifes­pan of 2824 days with a roughly Type III sur­vival curve (high ear­ly-life mor­tal­i­ty); a Cox re­gres­sion finds sim­i­lar re­sults as the lo­gis­tic - so­cial­ness, free, ac­qui­si­tion, and long life pre­dict lower mor­tal­i­ty. Us­ing the best mod­el, I make pre­dic­tions about prob­a­bil­ity of shut­down of the most risky and least risky ser­vices in the next 5 years (up to March 2018). (All data & R source code is pro­vid­ed.)

Google has oc­ca­sion­ally shut down ser­vices I use, and not al­ways with se­ri­ous warn­ing (many tech com­pa­nies are like that - here one day and gone the next - though Google is one of the least­-worst); this is frus­trat­ing and te­dious.

Nat­u­ral­ly, we are preached at by apol­o­gists that Google owes us noth­ing and if it’s a prob­lem then it’s all our fault and we should have proph­e­sied the fu­ture bet­ter (and too bad about the or­di­nary peo­ple who may be screwed over or the unique his­tory1 or data ca­su­ally de­stroyed).

But how can we have any sort of ra­tio­nal ex­pec­ta­tion if we lack any data or ideas about how long Google will run any­thing or why or how it chooses to do what it does? So in the fol­low­ing es­say, I try to get an idea of the risk, and hope­fully the re­sults are in­ter­est­ing, use­ful, or both.

A glance back

“This is some­thing that lit­er­a­ture has al­ways been very keen on, that tech­nol­ogy never gets around to ac­knowl­edg­ing. The cold wind moan­ing through the empty stone box. When are you gonna own up to it? Where are the Dell PCs? This is Austin, Texas. Michael Dell is the biggest tech mogul in cen­tral Texas. Why is he not here? Why is he not at least not sell­ing his wares? Where are the ded­i­cated gam­ing con­soles you used to love? Do you re­mem­ber how im­por­tant those were? I could spend all day here just recit­ing the names of the causal­i­ties in your line of work. It’s al­ways the elec­tronic fron­tier. No­body ever goes back to look at the elec­tronic forests that were cut down with chain­saws and tossed into the rivers. And then there’s this empty pre­tense that these in­no­va­tions make the world ‘bet­ter’…Like: ‘If we’re not mak­ing the world bet­ter, then why are we do­ing this at all?’ Now, I don’t want to claim that this at­ti­tude is hyp­o­crit­i­cal. Be­cause when you say a thing like that at South By: ‘Oh, we’re here to make the world bet­ter’—you haven’t even reached the level of hypocrisy. You’re stuck at the level of child­ish naivete.”

, “Text of SXSW2013 clos­ing re­marks”

The shut­down of the pop­u­lar ser­vice , an­nounced on 2013-03-13, has brought home to many peo­ple that some prod­ucts they rely on ex­ist only at Google’s suffer­ance: it pro­vides the prod­ucts for rea­sons that are diffi­cult for out­siders to di­vine, may have lit­tle com­mit­ment to a prod­uct234, may not in­clude their users’ best in­ter­ests, may choose to with­draw the prod­uct at any time for any rea­son5 (e­spe­cially since most of the prod­ucts are ser­vices6 & not in any way, and may be too tightly cou­pled with the Google in­fra­struc­ture7 to be spun off or sold, so when the CEO turns against it & no Googlers are will­ing to waste their ca­reers cham­pi­oning it…), and users have no voice8 - as an op­tion.

Andy Baio (“Never Trust a Cor­po­ra­tion to do a Li­brary’s Job”) sum­ma­rizes Google’s track record:

“Google’s mis­sion is to or­ga­nize the world’s in­for­ma­tion and make it uni­ver­sally ac­ces­si­ble and use­ful.”

For years, Google’s mis­sion in­cluded the preser­va­tion of the past. In 2001, Google made their first ac­qui­si­tion, the Deja archives. The largest col­lec­tion of Usenet archives, Google re­launched it as Google Groups, sup­ple­mented with archived mes­sages go­ing back to 1981. In 2004, Google Books sig­naled the com­pa­ny’s in­ten­tion to scan every known book, part­ner­ing with li­braries and de­vel­op­ing its own book scan­ner ca­pa­ble of dig­i­tiz­ing 1,000 pages per hour. In 2006, Google News Archive launched, with his­tor­i­cal news ar­ti­cles dat­ing back 200 years. In 2008, they ex­panded it to in­clude their own dig­i­ti­za­tion efforts, scan­ning news­pa­pers that were never on­line. In the last five years, start­ing around 2010, the shift­ing pri­or­i­ties of Google’s man­age­ment left these archival projects in lim­bo, or aban­doned en­tire­ly. After a se­ries of re­designs, Google Groups is effec­tively dead for re­search pur­pos­es. The archives, while still on­line, have no means of search­ing by date. Google News Archives are dead, killed off in 2011, now di­rect­ing searchers to just use Google. Google Books is still on­line, but cur­tailed their scan­ning efforts in re­cent years, likely dis­cour­aged by a decade of le­gal wran­gling still in ap­peal. The offi­cial blog stopped up­dat­ing in 2012 and the Twit­ter ac­coun­t’s been dor­mant since Feb­ru­ary 2013. Even Google Search, their flag­ship pro­duct, stopped fo­cus­ing on the his­tory of the web. In 2011, Google re­moved the Time­line view let­ting users fil­ter search re­sults by date, while a se­ries of ma­jor changes to their search rank­ing al­go­rithm in­creas­ingly fa­vored fresh­ness over older pages from es­tab­lished sources. (To the detri­ment of some.)…As it turns out, or­ga­niz­ing the world’s in­for­ma­tion is­n’t al­ways profitable. Projects that pre­serve the past for the pub­lic good aren’t re­ally a big profit cen­ter. Old Google knew that, but did­n’t seem to care.

In the case of Read­er, while Reader de­stroyed the orig­i­nal RSS reader mar­ket, there still ex­ist some us­able al­ter­na­tives; the con­se­quence is a shrink­age in the RSS au­di­ence as in­evitably many users choose not to in­vest in a new reader or give up or in­ter­pret it as a death­blow to RSS, and an ir­re­versible loss of Read­er’s uniquely com­pre­hen­sive RSS archives back to 2005. Al­though to be fair, I should men­tion 2 ma­jor points in fa­vor of Google:

  1. a rea­son I did and still do use Google ser­vices is that, with a few lapses like Web­site Op­ti­mizer aside, they are al­most unique in en­abling users to back up their data via the work of the and have been far more proac­tive than many com­pa­nies in en­cour­ag­ing users to back up data from dead ser­vices - for ex­am­ple, in au­to­mat­i­cally copy­ing Buzz users’ data to their Google Dri­ve.
  2. Google’s prac­tices of un­der­cut­ting all mar­ket in­cum­bents with free ser­vices also has very large ben­e­fits9, so we should­n’t fo­cus just on the seen.

But nev­er­the­less, every shut­down still hurts its users to some de­gree, even if we - cur­rently10 - can rule out the most dev­as­tat­ing pos­si­ble shut­downs, like Gmail. It would be in­ter­est­ing to see if shut­downs are to some de­gree pre­dictable, whether there are any pat­terns, whether com­mon claims about rel­e­vant fac­tors can be con­firmed, and what the re­sults might sug­gest for the fu­ture.

Data

Sources

Dead products

“The sum­mer grass­es—
the sole rem­nants of many
brave war­riors’ dreams.”

Basho

I be­gin with a list of services/APIs/programs that Google has shut­down or aban­doned taken from the Guardian ar­ti­cle “Google Keep? It’ll prob­a­bly be with us un­til March 2017 - on av­er­age: The clo­sure of Google Reader has got early adopters and de­vel­op­ers wor­ried that Google ser­vices or APIs they adopt will just get shut off. An analy­sis of 39 shut­tered offer­ings says how long they get” by Charles Arthur. Arthur’s list seemed rel­a­tively com­plete, but I’ve added in >300 items he missed based on the Slate grave­yard, We­ber’s “Google Fails 36% Of The Time”11, the Wikipedia cat­e­gory/ for Google ac­qui­si­tions, the Wikipedia cat­e­gory/list, and fi­nally the offi­cial Google His­tory. (The ad­di­tional shut­downs in­clude many shut­downs pre­dat­ing 2010, sug­gest­ing that Arthur’s list was bi­ased to­wards re­cent shut­down­s.)

In a few cas­es, the start dates are well-in­formed guesses (eg. Google Trans­late) and dates of aban­don­men­t/shut-down are even harder to get due to the lack of at­ten­tion paid to most (Joga Boni­to) and so I in­fer the date from archived pages on the In­ter­net Archive, news re­ports, blogs such as Google Op­er­at­ing Sys­tem, the dates of press re­leas­es, the shut­down of closely re­lated ser­vices (eReader Play based on Read­er), source code repos­i­to­ries (An­gu­lar­JS) etc; some are listed as dis­con­tin­ued (Google Cat­a­logs) but are still sup­ported or were merged into other soft­ware (Spread­sheets, Docs, Write­ly, News Archive) or sol­d/­given to third par­ties (Flu Shot Find­er, App In­ven­tor, Body) or ac­tive effort has ceased but the con­tent re­mains and so I do not list those as dead; for cases of ac­quired soft­ware/ser­vices that were shut­down, I date the start from Google’s pur­chase.

Live products

“…He often ly­ing broad awake, and yet / Re­main­ing from the body, and apart / In in­tel­lect and power and will, hath heard / Time flow­ing in the mid­dle of the night, / And all things creep­ing to a day of doom.”

, “The Mys­tic”, Po­ems, Chiefly Lyri­cal

A ma­jor crit­i­cism of Arthur’s post was that it was fun­da­men­tally us­ing the wrong data: if you have a dataset of all Google prod­ucts which have been shut­down, you can make state­ments like “the av­er­age dead Google prod­uct lived 1459 days”, but you can’t in­fer very much about a live pro­duc­t’s life ex­pectancy - be­cause you don’t know if it will join the dead prod­ucts. If, for ex­am­ple, only 1% of prod­ucts ever died, then 1459 days would lead to a mas­sive un­der­es­ti­mates of the av­er­age lifes­pan of all cur­rently liv­ing prod­ucts. With his data, you can only make in­fer­ences con­di­tional on a prod­uct even­tu­ally dy­ing, you can­not make an un­con­di­tional in­fer­ence. Un­for­tu­nate­ly, the un­con­di­tional ques­tion “will it die?” is the real ques­tion any Google user wants an­swered!

So draw­ing on the same sources, I have com­piled a sec­ond list of liv­ing prod­ucts; the ra­tio of liv­ing to dead gives a base rate for how likely a ran­domly se­lected Google prod­uct is to be can­celed within the 1997-2013 win­dow, and with the date of the found­ing of each liv­ing pro­duct, we can also do a sim­ple right-cen­sored which will let us make bet­ter still pre­dic­tions by ex­tract­ing con­crete re­sults like mean time to shut­down. Some items are dead in the most mean­ing­ful sense since they have been closed to new users (Sync), lost ma­jor func­tion­al­ity (Feed­Burn­er, Mee­bo), de­graded se­verely due to ne­glect (eg. ), or just been com­pletely ne­glected for a decade or more (Google Group’s Usenet archive) - but haven’t ac­tu­ally died or closed yet, so I list them as alive.

Variables

“To my good friend Would I show, I thought, The plum blos­soms, Now lost to sight Amid the falling snow.”

, VIII: 1426

Sim­ply col­lect­ing the data is use­ful since it al­lows us to make some es­ti­mates like over­all death-rates or me­dian lifes­pan. But maybe we can do bet­ter than just base rates and find char­ac­ter­is­tics which let us crack open the Google black box a tiny bit. So fi­nal­ly, for all prod­ucts, I have col­lected sev­eral co­vari­ates which I thought might help pre­dict longevi­ty:

  • Hits: the num­ber of Google hits for a ser­vice

    While num­ber of Google hits is a very crude mea­sure, at best, for un­der­ly­ing vari­ables like “pop­u­lar­ity” or “num­ber of users” or “profitabil­ity”, and clearly bi­ased to­wards re­cently re­leased prod­ucts (there aren’t go­ing to be as many hits for, say, “Google An­swers” as there would have been if we had searched for it in 2002), it may add some in­sight.

    There do not seem to be any other free qual­ity sources in­di­cat­ing ei­ther his­tor­i­cal or con­tem­po­rary traffic to a prod­uct URL/homepage which could be used in the analy­sis - ser­vices like Alexa or Google Ad Plan­ner ei­ther are com­mer­cial, for do­mains on­ly, or sim­ply do not cover many of the URLs. (After I fin­ished data col­lec­tion, it was pointed out to me that while Google’s Ad Plan­ner may not be use­ful, Google’s Ad­Words does yield a count of global searches for a par­tic­u­lar query that mon­th, which would have worked al­beit it would only in­di­cate cur­rent lev­els of in­ter­est and noth­ing about his­tor­i­cal lev­el­s.)

  • Type: a cat­e­go­riza­tion into “ser­vice”/“pro­gram”/“thing”/“other”

    1. A ser­vice is any­thing pri­mar­ily ac­cessed through a web browser or API or the In­ter­net; so Gmail or a browser load­ing fonts from a Google server, but not a Gmail no­ti­fi­ca­tion pro­gram one runs on one’s com­puter or a FLOSS font avail­able for down­load & dis­tri­b­u­tion.

    2. A pro­gram is any­thing which is an ap­pli­ca­tion, plug­in, li­brary, frame­work, or all of these com­bined; some are very small (Au­then­ti­ca­tor) and some are very large (An­droid). This does in­clude pro­grams which re­quire In­ter­net con­nec­tions or Google APIs as well as pro­grams for which the source code has not been re­leased, so things in the pro­gram cat­e­gory are not im­mune to shut­down and may be use­ful only as long as Google sup­ports them.

    3. A thing is any­thing which is pri­mar­ily a phys­i­cal ob­ject. A cell­phone run­ning An­droid or a Chrome­book would be an ex­am­ple.

      In ret­ro­spect, I prob­a­bly should have ex­cluded this cat­e­gory en­tire­ly: there’s no rea­son to ex­pect cell­phones to fol­low the same life­cy­cle as a ser­vice or pro­gram, it leads to even worse clas­si­fi­ca­tion prob­lems (when does an An­droid cell­phone ‘die’? should one even be look­ing at in­di­vid­ual cell­phones or lap­tops rather than en­tire prod­uct lines?), there tend to be many it­er­a­tions of a prod­uct and they’re all hard to re­search, etc.

    4. Other is the catch-all cat­e­gory for things which don’t quite seem to fit. Where does a Google think-tank, char­i­ty, con­fer­ence, or ven­ture cap­i­tal fund fit in? They cer­tainly aren’t soft­ware, but they don’t seem to be quite ser­vices ei­ther.

  • Profit: whether Google di­rectly makes money off a prod­uct

    This is a tricky one. Google ex­cuses many of its prod­ucts by say­ing that any­thing which in­creases In­ter­net us­age ben­e­fits Google and so by this log­ic, every sin­gle one of its ser­vices could po­ten­tially in­crease profit; but this is a lit­tle stretched, the truth very hard to judge by an out­sider, and one would ex­pect that prod­ucts with­out di­rect mon­e­ti­za­tion are more likely to be killed.

    Gen­er­al­ly, I clas­sify as for profit any Google prod­uct di­rectly re­lat­ing to pro­duc­ing/dis­play­ing ad­ver­tis­ing, paid sub­scrip­tions, fees, or pur­chases (Ad­Words, Gmail, Blog­ger, Search, shop­ping en­gi­nes, sur­veys); but many do not seem to have any form of mon­e­ti­za­tion re­lated to them (Alerts, Office, Dri­ve, Gears, Reader12). Some ser­vices like Voice charge (for in­ter­na­tional calls) but the amounts are mi­nor enough that one might won­der if clas­si­fy­ing them as for profit is re­ally right. While it might make sense to de­fine every fea­ture added to, say, Google Search (eg. Per­son­al­ized Search, or Search His­to­ry) as be­ing ‘for profit’ since Search lu­cra­tively dis­plays ads, I have cho­sen to clas­sify these sec­ondary fea­tures as be­ing not for profit.

  • FLOSS: whether the source code was re­leased or Google oth­er­wise made it pos­si­ble for third par­ties to con­tinue the ser­vice or main­tain the ap­pli­ca­tion.

    In the long run, the util­ity of all non-Free soft­ware ap­proaches ze­ro. All non-Free soft­ware is a dead end.13

    An­droid, An­gu­lar­JS, and Chrome are all ex­am­ples of soft­ware prod­ucts where Google los­ing in­ter­est would not be fa­tal; ser­vices spun off to third par­ties would also count. Many of the code­bases rely on a pro­pri­etary Google API or ser­vice (e­spe­cially the mo­bile ap­pli­ca­tion­s), which means that this vari­able is not as mean­ing­ful and laud­able as one might ex­pect, so in the mi­nor­ity of cases where this vari­able is rel­e­vant, I code Dead & Ended as re­lated to whether & when Google aban­doned it, re­gard­less of whether it was then picked up by third par­ties or not. (Ex­am­ple: App In­ven­tor for An­droid is listed as dy­ing in De­cem­ber 2011, though it was then half a year later handed over to MIT, who has sup­ported it since.) It’s im­por­tant to not naively be­lieve that sim­ply be­cause source code is avail­able, Google sup­port does­n’t mat­ter.

  • Acquisition: whether it was re­lated to a pur­chase of a com­pany or li­cens­ing, or in­ter­nally de­vel­oped.

    This is use­ful for in­ves­ti­gat­ing the so-called “Google black hole”: Google has bought many star­tups (Dou­bleClick, Dodge­ball, An­droid, Pi­cas­a), or tech­nolo­gies/­data li­censed (SYSTRAN for Trans­late, Twit­ter data for Re­al-Time Search), but it’s claimed many stag­nate & wither (Jaiku, JotSpot, Dodge­ball, Za­gat). So we’ll in­clude this. If a closely re­lated prod­uct is de­vel­oped and re­leased after pur­chase, like a mo­bile ap­pli­ca­tion, I do not class it as an ac­qui­si­tion; just prod­ucts that were in ex­is­tence when the com­pany was pur­chased. I do not in­clude prod­ucts that Google dropped im­me­di­ately on pur­chase (Ap­ture, fflick, Spar­row, Re­qwire­less, Peak­Stream, Wavii) or where prod­ucts based on them have not been re­leased (Bump­Top).

Hits

Ide­ally we would have Google hits from the day be­fore a prod­uct was offi­cially killed, but the past is, alas, no longer ac­ces­si­ble to us, and we only have hits from searches I con­ducted 2013-04-01–2013-04-05. There are three main prob­lems with the Google hits met­ric:

  1. the Web keeps grow­ing, so 1 mil­lion hits in 2000 are not equiv­a­lent to 1 mil­lion hits in 2013
  2. ser­vices which are not killed live longer and can rack up more hits
  3. and the longer ago a pro­duc­t’s hits came into ex­is­tence, the more likely the rel­e­vant hits may be to have dis­ap­peared them­selves.

We can par­tially com­pen­sate by look­ing at hits av­er­aged by lifes­pan; 100k hits means much less for some­thing that lived for a decade than 100k hits means for some­thing that lived just 6 months. What about the growth ob­jec­tion? We can es­ti­mate the size of Google’s in­dex at any pe­riod and in­ter­pret the cur­rent hits as a frac­tion of the in­dex when the ser­vice died (ex­am­ple: sup­pose An­swers has 1 mil­lion hits, died in 2006, and in 2006 the in­dex held 1 bil­lion URLs, then we’d turn our 1m hit fig­ure into 1/1000 or 0.001); this gives us our “de­flated hits”. We’ll de­flate the hits by first es­ti­mat­ing the size of the in­dex by fit­ting an ex­po­nen­tial to the rare pub­lic re­ports and third-party es­ti­mates of the size of the Google in­dex. The data points with the best lin­ear fit:

Es­ti­mat­ing Google WWW in­dex size over time

It fits rea­son­ably well. (A sig­moid might fit bet­ter, but maybe not, given the large dis­agree­ments to­wards the end.) With this we can then av­er­age over days as well, giv­ing us 4 in­dices to use. We’ll look closer at the hit vari­ables lat­er.

Processing

If a prod­uct has not end­ed, the end-date is de­fined as 2013-04-01 (which is when I stopped com­pil­ing prod­uct­s); then the to­tal life­time is sim­ply the end-date mi­nus the start-date. The fi­nal CSV is avail­able at 2013-google.csv. (I wel­come cor­rec­tions from Googlers or Xooglers about any vari­ables like launch or shut­down dates or prod­ucts di­rectly rais­ing rev­enue.)

Analysis

“I spur my horse past ru­ins Ru­ins move a trav­el­er’s heart the old para­pets high and low the an­cient graves great and small the shud­der­ing shadow of a tum­ble­weed the steady sound of gi­ant trees. But what I lament are the com­mon bones un­named in the records of Im­mor­tals.”

14

Descriptive

Load­ing up our hard-won data and look­ing at an R sum­mary (for full source code re­pro­duc­ing all graphs and analy­ses be­low, see the ap­pen­dix; I wel­come sta­tis­ti­cal cor­rec­tions or elab­o­ra­tions if ac­com­pa­nied by equally re­pro­ducible R source code), we can see we have a lot of data to look at:

    Dead            Started               Ended                 Hits               Type
#  Mode :logical   Min.   :1997-09-15   Min.   :2005-03-16   Min.   :2.04e+03   other  : 14
#  FALSE:227       1st Qu.:2006-06-09   1st Qu.:2012-04-27   1st Qu.:1.55e+05   program: 92
#  TRUE :123       Median :2008-10-18   Median :2013-04-01   Median :6.50e+05   service:234
#                  Mean   :2008-05-27   Mean   :2012-07-16   Mean   :5.23e+07   thing  : 10
#                  3rd Qu.:2010-05-28   3rd Qu.:2013-04-01   3rd Qu.:4.16e+06
#                  Max.   :2013-03-20   Max.   :2013-11-01   Max.   :3.86e+09
#    Profit          FLOSS         Acquisition       Social             Days         AvgHits
#  Mode :logical   Mode :logical   Mode :logical   Mode :logical   Min.   :   1   Min.   :      1
#  FALSE:227       FALSE:300       FALSE:287       FALSE:305       1st Qu.: 746   1st Qu.:    104
#  TRUE :123       TRUE :50        TRUE :63        TRUE :45        Median :1340   Median :    466
#                                                                  Mean   :1511   Mean   :  29870
#                                                                  3rd Qu.:2112   3rd Qu.:   2980
#                                                                  Max.   :5677   Max.   :3611940
#   DeflatedHits    AvgDeflatedHits  EarlyGoogle      RelativeRisk    LinearPredictor
#  Min.   :0.0000   Min.   :-36.57   Mode :logical   Min.   : 0.021   Min.   :-3.848
#  1st Qu.:0.0000   1st Qu.: -0.84   FALSE:317       1st Qu.: 0.597   1st Qu.:-0.517
#  Median :0.0000   Median : -0.54   TRUE :33        Median : 1.262   Median : 0.233
#  Mean   :0.0073   Mean   : -0.95                   Mean   : 1.578   Mean   : 0.000
#  3rd Qu.:0.0001   3rd Qu.: -0.37                   3rd Qu.: 2.100   3rd Qu.: 0.742
#  Max.   :0.7669   Max.   :  0.00                   Max.   :12.556   Max.   : 2.530
#  ExpectedEvents   FiveYearSurvival
#  Min.   :0.0008   Min.   :0.0002
#  1st Qu.:0.1280   1st Qu.:0.1699
#  Median :0.2408   Median :0.3417
#  Mean   :0.3518   Mean   :0.3952
#  3rd Qu.:0.4580   3rd Qu.:0.5839
#  Max.   :2.0456   Max.   :1.3443

Shutdowns over time

Google Reader: “Who is it in the blogs that calls on me? / I hear a tongue shriller than all the YouTubes / Cry ‘Read­er!’ Speak, Reader is turn’d to hear.”

Dataset: “Be­ware the ideas of March.”

Act 1, scene 2, 15-19; with apolo­gies.

An in­ter­est­ing as­pect of the shut­downs is they are un­evenly dis­trib­uted by month as we can see with a chi-squared test (p = 0.014) and graph­i­cal­ly, with a ma­jor spike in Sep­tem­ber and then March/April15:

Shut­downs binned by month of year, re­veal­ing peaks in Sep­tem­ber, March, and April

As be­fits a com­pany which has grown enor­mously since 1997, we can see other im­bal­ances over time: eg. Google launched very few prod­ucts from 1997-2004, and many more from 2005 and on:

Starts binned by year

We can plot life­time against shut-down to get a clearer pic­ture:

All prod­ucts scat­ter-plot­ted date of open­ing vs lifes­pan

That clumpi­ness around 2009 is sus­pi­cious. To em­pha­size this bulge of shut­downs in late 2011-2012, we can plot the his­togram of dead prod­ucts by year and also a ker­nel den­si­ty:

Shut­down den­sity binned by year
Equiv­a­lent ker­nel den­sity (de­fault band­width)

The ker­nel den­sity brings out an as­pect of shut­downs we might have missed be­fore: there seems to be an ab­sence of re­cent shut downs. There are 4 shut downs sched­uled for 2013 but the last one is sched­uled for No­vem­ber, sug­gest­ing that we have seen the last of the 2013 ca­su­al­ties and that any fu­ture shut downs may be for 2014.

What ex­plains such graphs over time? One can­di­date is the 2011-04-04 ac­ces­sion of Larry Page to CEO, re­plac­ing Eric Schmidt who had been hired to pro­vide “adult su­per­vi­sion” for pre-IPO Google. He re­spected Steve Jobs greatly (he and Brin sug­gest­ed, be­fore meet­ing Schmidt, that their CEO be Job­s). Isaa­con’s Steve Jobs records that be­fore his death, Jobs had strongly ad­vised Page to “fo­cus”, and asked “What are the five prod­ucts you want to fo­cus on?”, say­ing “Get rid of the rest, be­cause they’re drag­ging you down.” And on 2011-07-14 Page post­ed:

…Greater fo­cus has also been an­other big fea­ture for me this quar­ter – more wood be­hind fewer ar­rows. Last mon­th, for ex­am­ple, we an­nounced that we will be clos­ing Google Health and Google Pow­er­Me­ter. We’ve also done sub­stan­tial in­ter­nal work sim­pli­fy­ing and stream­lin­ing our prod­uct lines. While much of that work has not yet be­come vis­i­ble ex­ter­nal­ly, I am very happy with our progress here. Fo­cus and pri­or­i­ti­za­tion are cru­cial given our amaz­ing op­por­tu­ni­ties.

While some have tried to dis­agree, it’s hard not to con­clude that in­deed, a wall of shut­downs fol­lowed in late 2011 and 2012. But this sound very much like a one-time purge: if one has a new fo­cus on fo­cus, then one may not be start­ing up as many ser­vices as be­fore and the ser­vices which one does start up should be more likely to sur­vive.

Modeling

Logistic regression

A first step in pre­dict­ing when a prod­uct will be shut­down is pre­dict­ing whether it will be shut­down. Since we’re pre­dict­ing a bi­nary out­come (a prod­uct liv­ing or dy­ing), we can use the usu­al: an or­di­nary . Our first look uses the main vari­ables plus the to­tal hits:

# Coefficients:
#                 Estimate Std. Error z value Pr(>|z|)
# (Intercept)       2.3968     1.0680    2.24    0.025
# Typeprogram       0.9248     0.8181    1.13    0.258
# Typeservice       1.2261     0.7894    1.55    0.120
# Typething         0.8805     1.1617    0.76    0.448
# ProfitTRUE       -0.3857     0.2952   -1.31    0.191
# FLOSSTRUE        -0.1777     0.3791   -0.47    0.639
# AcquisitionTRUE   0.4955     0.3434    1.44    0.149
# SocialTRUE        0.7866     0.3888    2.02    0.043
# log(Hits)        -0.3089     0.0567   -5.45  5.1e-08

In , >0 in­creases the chance of an event (shut­down) and <0 de­creases it. So look­ing at the co­effi­cients, we can ven­ture some in­ter­pre­ta­tions:

  • Google has a past his­tory of screw­ing up so­cial and then killing them

    This is in­ter­est­ing for con­firm­ing the gen­eral be­lief that Google has han­dled badly its so­cial prop­er­ties in the past, but I’m not sure how use­ful this is for pre­dict­ing the fu­ture: since Larry Page be­came ob­sessed with so­cial in 2009, a we might ex­pect any­thing to do with “so­cial” would be ei­ther merged into Google+ or oth­er­wise be kept on life sup­port longer than it would be­fore

  • Google is dep­re­cat­ing soft­ware prod­ucts in fa­vor of web ser­vices

    A lot of Google’s efforts with Fire­fox and then Chromium was for im­prov­ing web browsers as a plat­form for de­liv­er­ing ap­pli­ca­tions. As efforts like HTML5 ma­ture, there is less in­cen­tive for Google to re­lease and sup­port stand­alone soft­ware.

  • But ap­par­ently not its FLOSS soft­ware

    This seems due to a num­ber of its soft­ware re­leases be­ing picked up by third-par­ties (Wave, Ether­pad, Re­fine), de­signed to be in­te­grated into ex­ist­ing com­mu­ni­ties (Sum­mer of Code pro­ject­s), or ap­par­ently serv­ing a strate­gic role (An­droid, Chromi­um, Dart, Go, Clo­sure Tools, VP Codecs) in which we could sum­ma­rize as ‘build­ing up a browser re­place­ment for op­er­at­ing sys­tems’. (Why? )

  • things which charge or show ad­ver­tis­ing are more likely to sur­vive

    We ex­pect this, but it’s good to have con­fir­ma­tion (if noth­ing else, it par­tially val­i­dates the data).

  • Pop­u­lar­ity as mea­sured by Google hits seems to mat­ter

    …Or does it? This vari­able seems par­tic­u­larly treach­er­ous and sus­cep­ti­ble to re­verse-cau­sa­tion is­sues (does lack of hits di­ag­nose fail­ure, or does fail­ing cause lack of hits when I later searched?)

Use of hits data

Is our pop­u­lar­ity met­ric - or any of the 4 - trust­wor­thy? All this data has been col­lected after the fact, some­times many years; what if the data have been con­t­a­m­i­nated by the fact that some­thing shut­down? For ex­am­ple, by a burst of pub­lic­ity about an ob­scure ser­vice shut­ting down? (Iron­i­cal­ly, this page is con­tribut­ing to the in­fla­tion of hits for any dead ser­vice men­tioned.) Are we just see­ing in­for­ma­tion “leak­age”? Leak­age can be sub­tle, as I learned for my­self do­ing this analy­sis.

In­ves­ti­gat­ing fur­ther, hits by them­selves do mat­ter:

#             Estimate Std. Error z value Pr(>|z|)
# (Intercept)   3.4052     0.7302    4.66  3.1e-06
# log(Hits)    -0.3000     0.0549   -5.46  4.7e-08

Av­er­age hits (hits over the pro­duc­t’s life­time) turns out to be even more im­por­tant:

#              Estimate Std. Error z value Pr(>|z|)
# (Intercept)    -2.297      1.586   -1.45    0.147
# log(Hits)       0.511      0.209    2.44    0.015
# log(AvgHits)   -0.852      0.217   -3.93  8.3e-05

This is more than a lit­tle strange; the higher the av­er­age hits, the less likely to be killed makes per­fect sense but then, surely the higher the hits, the less likely as well? But no. The mys­tery deep­ens as we bring in the third hit met­ric we de­vel­oped:

#                   Estimate Std. Error z value Pr(>|z|)
# (Intercept)        -21.589     11.955   -1.81   0.0709
# log(Hits)            2.054      0.980    2.10   0.0362
# log(AvgHits)        -1.921      0.708   -2.71   0.0067
# log(DeflatedHits)   -0.456      0.277   -1.64   0.1001

And sure enough, if we run all 4 hit vari­ables, 3 of them turn out to be sta­tis­ti­cal­ly-sig­nifi­cant and large:

#                   Estimate Std. Error z value Pr(>|z|)
# (Intercept)       -24.6898    12.4696   -1.98   0.0477
# log(Hits)           2.2908     1.0203    2.25   0.0248
# log(AvgHits)       -2.0943     0.7405   -2.83   0.0047
# log(DeflatedHits)  -0.5383     0.2914   -1.85   0.0647
# AvgDeflatedHits    -0.0651     0.0605   -1.08   0.2819

It’s not that the hit vari­ables are some­how sum­ma­riz­ing or prox­y­ing for the oth­ers, be­cause if we toss in all the non-hits pre­dic­tors and pe­nal­ize pa­ra­me­ters based on adding com­plex­ity with­out in­creas­ing fit, we still wind up with the 3 hit vari­ables:

#                   Estimate Std. Error z value Pr(>|z|)
# (Intercept)        -23.341     12.034   -1.94   0.0524
# AcquisitionTRUE      0.631      0.350    1.80   0.0712
# SocialTRUE           0.907      0.394    2.30   0.0213
# log(Hits)            2.204      0.985    2.24   0.0252
# log(AvgHits)        -2.068      0.713   -2.90   0.0037
# log(DeflatedHits)   -0.492      0.280   -1.75   0.0793
# ...
# AIC: 396.9

Most of the pre­dic­tors were re­moved as not help­ing a lot, 3 of the 4 hit vari­ables sur­vived (but not the both av­er­aged & de­flated hits, sug­gest­ing it was­n’t adding much in com­bi­na­tion), and we see two of the bet­ter pre­dic­tors from ear­lier sur­vived: whether some­thing was an ac­qui­si­tion and whether it was so­cial.

The orig­i­nal hits vari­able has the wrong sign, as ex­pected of data leak­age; now the av­er­age and de­flated hits have the pre­dicted sign (the higher the hit count, the lower the risk of death), but this does­n’t put to rest my con­cerns: the av­er­age hits has the right sign, yes, but now the effect size seems way too high - we re­ject the hits with a log-odds of +2.1 as con­t­a­m­i­nated and a cor­re­la­tion al­most 4 times larger than one of the known-good cor­re­la­tions (be­ing an ac­qui­si­tion), but the av­er­age hits is -2 & al­most as big a log odds! The only vari­able which seems trust­wor­thy is the de­flated hits: it has the right sign and is a more plau­si­ble 5x small­er. I’ll use just the de­flated hits vari­able (although I will keep in mind that I’m still not sure it is free from data leak­age).

Survival curve

The lo­gis­tic re­gres­sion helped win­now down the vari­ables but is lim­ited to the bi­nary out­come of shut­down or not; it can’t use the po­ten­tially very im­por­tant vari­able of how many days a prod­uct has sur­vived for the sim­ple rea­son that of course mor­tal­ity will in­crease with time! (“But this long run is a mis­lead­ing guide to cur­rent affairs. In the long run we are all dead.”)

For look­ing at sur­vival over time, might be a use­ful elab­o­ra­tion. Not be­ing pre­vi­ously fa­mil­iar with the area, I drew on Wikipedia, Fox & Weis­berg’s ap­pen­dix, , Zhou’s tu­to­r­ial, and Hos­mer & Lemeshow’s Ap­plied Sur­vival Analy­sis for the fol­low­ing re­sults us­ing the survival li­brary (see also CRAN Task View: Sur­vival Analy­sis, and the tax­on­omy of sur­vival analy­sis meth­ods in ). Any er­rors are mine.

The ini­tial char­ac­ter­i­za­tion gives us an op­ti­mistic me­dian of 2824 days (note that this is higher than Arthur’s mean of 1459 days be­cause it ad­dressed the con­di­tion­al­ity is­sue dis­cussed ear­lier by in­clud­ing prod­ucts which were never can­celed, and I made a stronger effort to col­lect pre-2009 prod­uct­s), but the lower bound is not tight and too lit­tle of the sam­ple has died to get an up­per bound:

# records   n.max n.start  events  median 0.95LCL 0.95UCL
#     350     350     350     123    2824    2095      NA

Our over­all looks a bit in­ter­est­ing:

Shut­down cu­mu­la­tive prob­a­bil­ity as a func­tion of time

If there were con­stant mor­tal­ity of prod­ucts at each day after their launch, we would ex­pect a “type II” curve where it looks like a straight line, and if the haz­ard in­creased with age like with hu­mans we would see a “type I” graph in which the curve nose-di­ves; but in fact it looks like there’s a sort of “lev­el­ing off” of deaths, sug­gest­ing a “type III” curve; per Wikipedia:

…the great­est mor­tal­ity is ex­pe­ri­enced early on in life, with rel­a­tively low rates of death for those sur­viv­ing this bot­tle­neck. This type of curve is char­ac­ter­is­tic of species that pro­duce a large num­ber of off­spring (see ).

Very nifty: the sur­vivor­ship curve is con­sis­tent with tech in­dus­try or startup philoso­phies of do­ing lots of things, it­er­at­ing fast, and throw­ing things at the wall to see what sticks. (More pleas­ing­ly, it sug­gests that my dataset is not bi­ased against the in­clu­sion of short­-lived prod­ucts: if I had been fail­ing to find a lot of short­-lived prod­ucts, then we would ex­pect to see the true sur­vivor­ship curve dis­torted into some­thing of a type II or type I curve and not a type III curve where a lot of prod­ucts are early deaths; so if there were a data col­lec­tion bias against short­-lived prod­ucts, then the true sur­vivor­ship curve must be even more ex­tremely type III.)

How­ev­er, it looks like the mor­tal­ity only starts de­creas­ing around 2000 days, so any prod­uct that far out must have been founded around or be­fore 2005, which is when we pre­vi­ously noted that Google started pump­ing out a lot of prod­ucts and may also have changed its shut­down-re­lated be­hav­iors; this could vi­o­late a ba­sic as­sump­tion of Ka­plan-Meier, that the un­der­ly­ing sur­vival func­tion is­n’t it­self chang­ing over time.

Our next step is to fit a Cox to our co­vari­ates:

# ...n= 350, number of events= 123
#
#                     coef exp(coef) se(coef)     z Pr(>|z|)
# AcquisitionTRUE    0.130     1.139    0.257  0.51    0.613
# FLOSSTRUE          0.141     1.151    0.293  0.48    0.630
# ProfitTRUE        -0.180     0.836    0.231 -0.78    0.438
# SocialTRUE         0.664     1.943    0.262  2.53    0.011
# Typeprogram        0.957     2.603    0.747  1.28    0.200
# Typeservice        1.291     3.638    0.725  1.78    0.075
# Typething          1.682     5.378    1.023  1.64    0.100
# log(DeflatedHits) -0.288     0.749    0.036 -8.01  1.2e-15
#
#                   exp(coef) exp(-coef) lower .95 upper .95
# AcquisitionTRUE       1.139      0.878     0.688     1.884
# FLOSSTRUE             1.151      0.868     0.648     2.045
# ProfitTRUE            0.836      1.197     0.531     1.315
# SocialTRUE            1.943      0.515     1.163     3.247
# Typeprogram           2.603      0.384     0.602    11.247
# Typeservice           3.637      0.275     0.878    15.064
# Typething             5.377      0.186     0.724    39.955
# log(DeflatedHits)     0.749      1.334     0.698     0.804
#
# Concordance= 0.726  (se = 0.028 )
# Rsquare= 0.227   (max possible= 0.974 )
# Likelihood ratio test= 90.1  on 8 df,   p=4.44e-16
# Wald test            = 79.5  on 8 df,   p=6.22e-14
# Score (logrank) test = 83.5  on 8 df,   p=9.77e-15

And then we can also test whether any of the co­vari­ates are sus­pi­cious; in gen­eral they seem to be fine:

#                       rho  chisq     p
# AcquisitionTRUE   -0.0252 0.0805 0.777
# FLOSSTRUE          0.0168 0.0370 0.848
# ProfitTRUE        -0.0694 0.6290 0.428
# SocialTRUE         0.0279 0.0882 0.767
# Typeprogram        0.0857 0.9429 0.332
# Typeservice        0.0936 1.1433 0.285
# Typething          0.0613 0.4697 0.493
# log(DeflatedHits) -0.0450 0.2610 0.609
# GLOBAL                 NA 2.5358 0.960

My sus­pi­cion lingers, though, so I threw in an­other co­vari­ate (EarlyGoogle): whether a prod­uct was re­leased be­fore or after 2005. Does this add pre­dic­tive value above and over sim­ply know­ing that a prod­uct is re­ally old, and does the re­gres­sion still pass the pro­por­tional as­sump­tion check? Ap­par­ently yes to both:

#                      coef exp(coef) se(coef)     z Pr(>|z|)
# AcquisitionTRUE    0.1674    1.1823   0.2553  0.66    0.512
# FLOSSTRUE          0.1034    1.1090   0.2922  0.35    0.723
# ProfitTRUE        -0.1949    0.8230   0.2318 -0.84    0.401
# SocialTRUE         0.6541    1.9233   0.2601  2.51    0.012
# Typeprogram        0.8195    2.2694   0.7472  1.10    0.273
# Typeservice        1.1619    3.1960   0.7262  1.60    0.110
# Typething          1.6200    5.0529   1.0234  1.58    0.113
# log(DeflatedHits) -0.2645    0.7676   0.0375 -7.06  1.7e-12
# EarlyGoogleTRUE   -1.0061    0.3656   0.5279 -1.91    0.057
# ...
# Concordance= 0.728  (se = 0.028 )
# Rsquare= 0.237   (max possible= 0.974 )
# Likelihood ratio test= 94.7  on 9 df,   p=2.22e-16
# Wald test            = 76.7  on 9 df,   p=7.2e-13
# Score (logrank) test = 83.8  on 9 df,   p=2.85e-14
#                        rho   chisq     p
# ...
# EarlyGoogleTRUE   -0.05167 0.51424 0.473
# GLOBAL                  NA 2.52587 0.980

As pre­dict­ed, the pre-2005 vari­able does in­deed cor­re­late to less chance of be­ing shut­down, is the third-largest pre­dic­tor, and al­most reaches a ran­dom16 level of sta­tis­ti­cal-sig­nifi­cance - but it does­n’t trig­ger the as­sump­tion tester, so we’ll keep us­ing the Cox mod­el.

Now let’s in­ter­pret the mod­el. The co­vari­ates tell us that to re­duce the risk of shut­down, you want to:

  1. Not be an ac­qui­si­tion
  2. Not be FLOSS
  3. Be di­rectly mak­ing money
  4. Not be re­lated to so­cial net­work­ing
  5. Have lots of Google hits rel­a­tive to life­time
  6. Have been launched early in Google’s life­time

This all makes sense to me. I find par­tic­u­larly in­ter­est­ing the profit and so­cial effects, but the odds are a lit­tle hard to un­der­stand in­tu­itive­ly; if be­ing so­cial in­creases the odds of shut­down by 1.9233 and not be­ing di­rectly profitable in­creases the odds by 1.215, what do those look like? We can graph pairs of sur­vivor­ship curves, split­ting the full dataset (omit­ting the con­fi­dence in­ter­vals for leg­i­bil­i­ty, al­though they do over­lap), to get a grasp of what these num­bers mean:

All prod­ucts over time, split by Profit vari­able
All prod­ucts over time, split by Social vari­able

Random forests

Be­cause I can, I was cu­ri­ous how (Breiman 2001) might stack up to the lo­gis­tic re­gres­sion and against a base-rate pre­dic­tor (that noth­ing was shut down, since ~65% of the prod­ucts are still alive).

With randomForest, I trained a ran­dom for­est as a clas­si­fier, yield­ing rea­son­able look­ing er­ror rates:

#                Type of random forest: classification
#                      Number of trees: 500
# No. of variables tried at each split: 2
#
#         OOB estimate of  error rate: 31.71%
# Confusion matrix:
#       FALSE TRUE class.error
# FALSE   216   11     0.04846
# TRUE    100   23     0.81301

To com­pare the ran­dom for­est ac­cu­racy with the lo­gis­tic mod­el’s ac­cu­ra­cy, I in­ter­preted the lo­gis­tic es­ti­mate of shut­down odds >1 as pre­dict­ing shut­down and <1 as pre­dict­ing not shut­down; I then com­pared the full sets of pre­dic­tions with the ac­tual shut­down sta­tus. (This is not a like those I em­ployed in grad­ing fore­casts of the , but this should be an in­tu­itively un­der­stand­able way of grad­ing mod­els’ pre­dic­tion­s.)

The base-rate pre­dic­tor got 65% right by de­fi­n­i­tion, the lo­gis­tic man­aged to score 68% cor­rect (17 95% CI: 66-72%), and the ran­dom for­est sim­i­larly got 68% (67-78%). These rates are not quite as bad as they may seem: I ex­cluded the life­time length (Days) from the lo­gis­tic and ran­dom forests be­cause un­less one is han­dling it spe­cially with sur­vival analy­sis, it leaks in­for­ma­tion; so there’s pre­dic­tive power be­ing left on the table. A fairer com­par­i­son would use life­times.

Random survival forests

The next step is to take into ac­count life­time length & es­ti­mated sur­vival curves. We can do that us­ing (see also “Mo­gensen et al 2012”), im­ple­mented in randomForestSRC (suc­ces­sor to Ish­waran’s orig­i­nal li­brary randomSurvivalForest). This ini­tially seems very promis­ing:

#                          Sample size: 350
#                     Number of deaths: 122
#                      Number of trees: 1000
#           Minimum terminal node size: 3
#        Average no. of terminal nodes: 61.05
# No. of variables tried at each split: 3
#               Total no. of variables: 7
#                             Analysis: Random Forests [S]RC
#                               Family: surv
#                       Splitting rule: logrank *random*
#        Number of random split points: 1
#               Estimate of error rate: 35.37%

and even gives us a cute plot of how ac­cu­racy varies with how big the for­est is (looks like we don’t need to tweak it) and how im­por­tant each vari­able is as a pre­dic­tor:

Vi­sual com­par­i­son of the av­er­age use­ful­ness of each vari­able to de­ci­sion trees

Es­ti­mat­ing the er­ror rate for this ran­dom sur­vival for­est like we did pre­vi­ous­ly, we’re happy to see a 78% er­ror rate. Build­ing a pre­dic­tor based on the Cox mod­el, we get a lesser (but still bet­ter than the non-sur­vival mod­els) 72% er­ror rate.

How do these mod­els per­form when we check their ro­bust­ness via the boot­strap? Not so great. The ran­dom sur­vival for­est col­lapses to 57-64% (95% on 200 repli­cates), but the Cox model just to 68-73%. This sug­gests to me that some­thing is go­ing wrong with the ran­dom sur­vival for­est model (over­fit­ting? pro­gram­ming er­ror?) and there’s no real rea­son to switch to the more com­plex ran­dom forests, so here too we’ll stick with the or­di­nary Cox mod­el.

Predictions

Be­fore mak­ing ex­plicit pre­dic­tions of the fu­ture, let’s look at the for prod­ucts which haven’t been shut­down. What does the Cox model con­sider the 10 most at risk and likely to be shut­down prod­ucts?

It lists (in de­creas­ingly risky or­der):

  1. Schemer
  2. Bou­tiques
  3. Mag­ni­fier
  4. Hot­pot
  5. Page Speed On­line API
  6. What­son­When
  7. Un­offi­cial Guides
  8. WDYL search en­gine
  9. Cloud Mes­sag­ing
  10. Cor­re­late

These all seem like rea­son­able prod­ucts to sig­nal out (as much as I love Cor­re­late for mak­ing it eas­ier than ever to demon­strate “cor­re­la­tion ≠ cau­sa­tion”, I’m sur­prised it or Bou­tiques still ex­ist), ex­cept for Cloud Mes­sag­ing which seems to be a key part of a lot of An­droid. And like­wise, the list of the 10 least risky (in­creas­ingly risky or­der):

  1. Search
  2. Trans­late
  3. Ad­Words
  4. Pi­casa
  5. Groups
  6. Im­age Search
  7. News
  8. Books
  9. Tool­bar
  10. Ad­Sense

One can’t imag­ine flag­ship prod­ucts like Search or Books ever be­ing shut down, so this list is good as far as it goes; I am skep­ti­cal about the ac­tual un­risk­i­ness of Pi­casa and Tool­bar given their gen­eral ne­glect and old-fash­ioned­ness, though I un­der­stand why the model fa­vors them (both are pre-2005, pro­pri­etary, many hits, and ad­ver­tis­ing-sup­port­ed). But let’s get more speci­fic; look­ing at still alive ser­vices, what pre­dic­tions do we make about the odds of a se­lected batch sur­viv­ing the next, say, 5 years? We can de­rive a sur­vival curve for each mem­ber of the batch ad­justed for each sub­jec­t’s co­vari­ates (and they vis­i­bly differ from each oth­er):

Es­ti­mated curves for 15 in­ter­est­ing prod­ucts (Ad­Sense, Schol­ar, Voice, etc)

But these are the curves for hy­po­thet­i­cal pop­u­la­tions all like the spe­cific prod­uct in ques­tion, start­ing from Day 0. Can we ex­tract spe­cific es­ti­mates as­sum­ing the prod­uct has sur­vived to to­day (as by de­fi­n­i­tion these live ser­vices have done)? Yes, but ex­tract­ing them turns out to be a pretty grue­some hack to ex­tract pre­dic­tions from sur­vival curves; any­way, I de­rive the fol­low­ing 5-year es­ti­mates and as com­men­tary, reg­is­ter my own best guesses as well (I’m at mak­ing pre­dic­tion­s):

Prod­uct 5-year sur­vival Per­sonal guess Rel­a­tive risk vs av­er­age (low­er=­bet­ter) Sur­vived (March 2018)
Ad­Sense 100% 99% 0.07 Yes
Blog­ger 100% 80% 0.32 Yes
Gmail 96% 99% 0.08 Yes
Search 96% 100% 0.05 Yes
Trans­late 92% 95% 0.78 Yes
Scholar 92% 85% 0.10 Yes
Alerts 89% 70% 0.21 Yes
Google+ 79% 85% 0.36 Yes18
An­a­lyt­ics 76% 97% 0.24 Yes
Chrome 70% 95% 0.24 Yes
Cal­en­dar 66% 95% 0.36 Yes
Docs 63% 95% 0.39 Yes
Voice19 44% 50% 0.78 Yes
Feed­Burner 43% 35% 0.66 Yes
Project Glass 37% 50% 0.10 No

One im­me­di­ately spots that some of the mod­el’s es­ti­mates seem ques­tion­able in the light of our greater knowl­edge of Google.

I am more pes­simistic about the . And I think it’s ab­surd to give any se­ri­ous cre­dence An­a­lyt­ics or Cal­en­dar or Docs be­ing at risk (An­a­lyt­ics is a key part of the ad­ver­tis­ing in­fra­struc­ture, and Cal­en­dar a sine qua non of any busi­ness soft­ware suite - much less the core of said suite, Doc­s!). The Glass es­ti­mate is also in­ter­est­ing: I don’t know if I agree with the mod­el, given how fa­mous Glass is and how much Google is push­ing it - could its fu­ture re­ally be so chancy? On the other hand, many tech fads have come and go with­out a trace, hard­ware is al­ways tricky, the more in­ti­mate a gad­get the more de­sign mat­ters (Glass seems like the sort of thing Ap­ple could make a block­buster, but can Google?), Glass has al­ready re­ceived a hefty help­ing of crit­i­cism, par­tic­u­larly the man most ex­pe­ri­enced with such HUDs () has crit­i­cized Glass as be­ing “much less am­bi­tious” than the state of the art and wor­ries that “Google and cer­tain other com­pa­nies are ne­glect­ing some im­por­tant lessons. Their de­sign de­ci­sions could make it hard for many folks to use these sys­tems. Worse, poorly con­fig­ured prod­ucts might even dam­age some peo­ple’s eye­sight and set the move­ment back years. My con­cern comes from di­rect ex­pe­ri­ence.”

But some es­ti­mates are more for­giv­able - Google does have a bad track record with so­cial me­dia so some level of skep­ti­cism about Google+ seems war­ranted (and in­deed, in Oc­to­ber 2018 Google qui­etly an­nounced pub­lic Google+ would be shut down & hence­forth only an en­ter­prise prod­uct) - and on Feed­Burner or Voice, I agree with the model that their fu­ture is cloudy. The ex­treme op­ti­mism about Blog­ger is in­ter­est­ing since be­fore I be­gan this pro­ject, I thought it was slowly dy­ing and would in­evitably shut down in a few years; but as I re­searched the time­lines for var­i­ous Google prod­ucts, I no­ticed that Blog­ger seems to be fa­vored in some ways: such as get­ting ex­clu­sive ac­cess to a few oth­er­wise shut­down things (eg. Scribe & Friend Con­nec­t); it was the ground zero for Google’s Dy­namic Views skin re­design which was ap­plied glob­al­ly; and Google is still heav­ily us­ing Blog­ger for all its offi­cial an­nounce­ments even into the Google+ era.

Over­all, these are pretty sane-sound­ing es­ti­mates.

Followups

“Show me the per­son who does­n’t die— death re­mains im­par­tial. I re­call a tow­er­ing man who is now a pile of dust. The World Be­low knows no dawn though plants en­joy an­other spring; those vis­it­ing this sor­row­ful place the pine wind slays with grief.”

Han-Shan, #50

It seems like it might be worth­while to con­tinue com­pil­ing a data­base and do a fol­lowup analy­sis in 5 years (2018), by which point we can judge how my pre­dic­tions stacked up against the mod­el, and also be­cause ~100 prod­ucts may have been shut down (go­ing by the >30 ca­su­al­ties of 2011 and 2012) and the sur­vival curve & co­vari­ate es­ti­mates ren­dered that much sharp­er. So to com­pile up­dates, I’ve:

  • set up 2 Google Alerts search­es:

    • google ("shut down" OR "shut down" "shutting" OR "closing" OR "killing" OR "abandoning" OR "leaving")
    • google (launch OR release OR announce)
  • and sub­scribed to the afore­men­tioned Google Op­er­at­ing Sys­tem blog

These sources yielded ~64 can­di­dates over the fol­low­ing year be­fore I shut down ad­di­tions 2014-06-04.

See Also

Appendix

Source code

Run as R --slave --file=google.r:

set.seed(7777) # for reproducible numbers

library(survival)
library(randomForest)
library(boot)
library(randomForestSRC)
library(prodlim) # for 'sindex' call
library(rms)

# Generate Google corpus model for use in main analysis
# Load the data, fit, and plot:
index <- read.csv("https://www.gwern.net/docs/statistics/2013-google-index.csv",
                   colClasses=c("Date","double","character"))
# an exponential doesn't fit too badly:
model1 <- lm(log(Size) ~ Date, data=index); summary(model1)
# plot logged size data and the fit:
png(file="~/wiki/images/google/www-index-model.png", width = 3*480, height = 1*480)
plot(log(index$Size) ~ index$Date, ylab="WWW index size", xlab="Date")
abline(model1)
invisible(dev.off())

# Begin actual data analysis
google <- read.csv("https://www.gwern.net/docs/statistics/2013-google.csv",
                    colClasses=c("character","logical","Date","Date","double","factor",
                                 "logical","logical","logical","logical", "integer",
                                 "numeric", "numeric", "numeric", "logical", "numeric",
                                 "numeric", "numeric", "numeric"))
# google$Days <- as.integer(google$Ended - google$Started)
# derive all the Google index-variables
## hits per day to the present
# google$AvgHits <- google$Hits / as.integer(as.Date("2013-04-01") - google$Started)
## divide total hits for each product by total estimated size of Google index when that product started
# google$DeflatedHits <- log(google$Hits / exp(predict(model1, newdata=data.frame(Date = google$Started))))
## Finally, let's combine the two strategies: deflate and then average.
# google$AvgDeflatedHits <- log(google$AvgHits) / google$DeflatedHits
# google$DeflatedHits <- log(google$DeflatedHits)

cat("\nOverview of data:\n")
print(summary(google[-1]))

dead <- google[google$Dead,]

png(file="~/wiki/images/google/openedvslifespan.png", width = 1.5*480, height = 1*480)
plot(dead$Days ~ dead$Ended, xlab="Shutdown", ylab="Total lifespan")
invisible(dev.off())

png(file="~/wiki/images/google/shutdownsbyyear.png", width = 1.5*480, height = 1*480)
hist(dead$Ended, breaks=seq.Date(as.Date("2005-01-01"), as.Date("2014-01-01"), "years"),
                   main="shutdowns per year", xlab="Year")
invisible(dev.off())
png(file="~/wiki/images/google/shutdownsbyyear-kernel.png", width = 1*480, height = 1*480)
plot(density(as.numeric(dead$Ended)), main="Shutdown kernel density over time")
invisible(dev.off())

png(file="~/wiki/images/google/startsbyyear.png", width = 1.5*480, height = 1*480)
hist(google$Started, breaks=seq.Date(as.Date("1997-01-01"), as.Date("2014-01-01"), "years"),
     xlab="total products released in year")
invisible(dev.off())

# extract the month of each kill
m = months(dead$Ended)
# sort by chronological order, not alphabetical
m_fac = factor(m, levels = month.name)
# count by month
months <- table(sort(m_fac))
# shutdowns by month are imbalanced:
print(chisq.test(months))
# and visibly so:
png(file="~/wiki/images/google/shutdownsbymonth.png", width = 1.5*480, height = 1*480)
plot(months)
invisible(dev.off())

cat("\nFirst logistic regression:\n")
print(summary(glm(Dead ~ Type + Profit + FLOSS + Acquisition + Social + log(Hits),
                  data=google, family="binomial")))

cat("\nSecond logistic regression, focusing on treacherous hit data:\n")
print(summary(glm(Dead ~ log(Hits), data=google,family="binomial")))
cat("\nTotal + average:\n")
print(summary(glm(Dead ~ log(Hits) + log(AvgHits), data=google,family="binomial")))
cat("\nTotal, average, deflated:\n")
print(summary(glm(Dead ~ log(Hits) + log(AvgHits) + DeflatedHits, data=google,family="binomial")))
cat("\nAll:\n")
print(summary(glm(Dead ~ log(Hits) + log(AvgHits) + DeflatedHits + AvgDeflatedHits,
                  data=google, family="binomial")))
cat("\nStepwise regression through possible logistic regressions involving the hit variables:\n")
print(summary(step(glm(Dead ~ Type + Profit + FLOSS + Acquisition + Social +
                        log(Hits) + log(AvgHits) + DeflatedHits + AvgDeflatedHits,
                       data=google, family="binomial"))))

cat("\nEntering survival analysis section:\n")

cat("\nUnconditional Kaplan-Meier survival curve:\n")
surv <- survfit(Surv(google$Days, google$Dead, type="right") ~ 1)
png(file="~/wiki/images/google/overall-survivorship-curve.png", width = 1.5*480, height = 1*480)
plot(surv, xlab="Days", ylab="Survival Probability function with 95% CI")
invisible(dev.off())

cat("\nCox:\n")
cmodel <- coxph(Surv(Days, Dead) ~ Acquisition + FLOSS + Profit + Social + Type + DeflatedHits,
                data = google)
print(summary(cmodel))
cat("\nTest proportional assumption:\n")
print(cox.zph(cmodel))

cat("\nPrimitive check for regime change (re-regress & check):\n")
google$EarlyGoogle <- (as.POSIXlt(google$Started)$year+1900) < 2005
cmodel <- coxph(Surv(Days, Dead) ~ Acquisition + FLOSS + Profit + Social + Type +
                                   DeflatedHits + EarlyGoogle,
                data = google)
print(summary(cmodel))
print(cox.zph(cmodel))

cat("\nGenerating intuitive plots of social & profit;\n")
cat("\nPlot empirical survival split by profit...\n")
png(file="~/wiki/images/google/profit-survivorship-curve.png", width = 1.5*480, height = 1*480)
smodel1 <- survfit(Surv(Days, Dead) ~ Profit, data = google);
plot(smodel1, lty=c(1, 2), xlab="Days", ylab="Fraction surviving by Day");
legend("bottomleft", legend=c("Profit = no", "Profit = yes"), lty=c(1 ,2), inset=0.02)
invisible(dev.off())

cat("\nSplit by social...\n")
smodel2 <- survfit(Surv(Days, Dead) ~ Social, data = google)
png(file="~/wiki/images/google/social-survivorship-curve.png", width = 1.5*480, height = 1*480)
plot(smodel2, lty=c(1, 2), xlab="Days", ylab="Fraction surviving by Day")
legend("bottomleft", legend=c("Social = no", "Social = yes"), lty=c(1 ,2), inset=0.02)
invisible(dev.off())

cat("\nTrain some random forests for prediction:\n")
lmodel <- glm(Dead ~ Acquisition + FLOSS + Profit + Social + Type +
                     DeflatedHits + EarlyGoogle + Days,
                   data=google, family="binomial")
rf <- randomForest(as.factor(Dead) ~ Acquisition + FLOSS + Profit + Social +
                                     Type + DeflatedHits + EarlyGoogle,
                   importance=TRUE, data=google)
print(rf)
cat("\nVariables by importance for forests:\n")
print(importance(rf))

cat("\nBase-rate predictor of ~65% products alive:\n")
print(sum(FALSE == google$Dead) / nrow(google))
cat("\nLogistic regression's correct predictions:\n")
print(sum((exp(predict(lmodel))>1) == google$Dead) / nrow(google))
cat("\nRandom forest's correct predictions:\n")
print(sum((as.logical(predict(rf))) == google$Dead) / nrow(google))

cat("\nBegin bootstrap test of predictive accuracy...\n")
cat("\nGet a subsample, train logistic regression on it, test accuracy on original Google data:\n")
logisticPredictionAccuracy <- function(gb, indices) {
  g <- gb[indices,] # allows boot to select subsample
  # train new regression model on subsample
  lmodel <- glm(Dead ~ Acquisition + FLOSS + Profit + Social + Type +
                       DeflatedHits + EarlyGoogle + Days,
                   data=g, family="binomial")
  return(sum((exp(predict(lmodel, newdata=google))>1) == google$Dead) / nrow(google))
}
lbs <- boot(data=google, statistic=logisticPredictionAccuracy, R=20000, parallel="multicore", ncpus=4)
print(boot.ci(lbs, type="norm"))

cat("\nDitto for random forests:\n")
randomforestPredictionAccuracy <- function(gb, indices) {
  g <- gb[indices,]
  rf <- randomForest(as.factor(Dead) ~ Acquisition + FLOSS + Profit + Social + Type +
                       DeflatedHits + EarlyGoogle + Days,
                   data=g)
  return(sum((as.logical(predict(rf))) == google$Dead) / nrow(google))
}
rfbs <- boot(data=google, statistic=randomforestPredictionAccuracy, R=20000, parallel="multicore", ncpus=4)
print(boot.ci(rfbs, type="norm"))

cat("\nFancier comparison: random survival forests and full Cox model with bootstrap\n")
rsf <- rfsrc(Surv(Days, Dead) ~ Acquisition + FLOSS + Profit + Social + Type + DeflatedHits + EarlyGoogle,
             data=google, nsplit=1)
print(rsf)

png(file="~/wiki/images/google/rsf-importance.png", width = 1.5*480, height = 1*480)
plot(rsf)
invisible(dev.off())

# calculate cumulative hazard function; adapted from Mogensen et al 2012 (I don't understand this)
predictSurvProb.rsf <- function (object, newdata, times, ...) {
    N <- NROW(newdata)
    # class(object) <- c("rsf", "grow")
    S <- exp(-predict.rfsrc(object, test = newdata)$chf)
    if(N == 1) S <- matrix(S, nrow = 1)
    Time <- object$time.interest
    p <- cbind(1, S)[, 1 + sindex(Time, times),drop = FALSE]
    if(NROW(p) != NROW(newdata) || NCOL(p) != length(times))
     stop("Prediction failed")
    p
}
totals <- as.integer(as.Date("2013-04-01") - google$Started)
randomSurvivalPredictionAccuracy <- function(gb, indices) {
    g <- gb[indices,]
    rsfB <- rfsrc(Surv(Days, Dead) ~ Acquisition + FLOSS + Profit + Social + Type +
                                     DeflatedHits + EarlyGoogle,
                 data=g, nsplit=1)

    predictionMatrix <- predictSurvProb.rsf(rsfB, google, totals)
    rm(predictions)
    for (i in 1:nrow(google)) { predictions[i] <- predictionMatrix[i,i] }

    return(sum((predictions<0.50) == google$Dead) / nrow(google))
}
# accuracy on full Google dataset
print(randomSurvivalPredictionAccuracy(google, 1:nrow(google)))
# check this high accuracy using bootstrap
rsfBs <- boot(data=google, statistic=randomSurvivalPredictionAccuracy, R=200, parallel="multicore", ncpus=4)
print(rsfBs)
print(boot.ci(rsfBs, type="perc"))

coxProbability <- function(cm, d, t) {
    x <- survfit(cm, newdata=d)
    p <- x$surv[Position(function(a) a>t, x$time)]
    if (is.null(p)) { coxProbability(d, (t-1)) } else {if (is.na(p)) p <- 0}
    p
    }
randomCoxPredictionAccuracy <- function(gb, indices) {
    g <- gb[indices,]
    cmodel <- cmodel <- coxph(Surv(Days, Dead) ~ Acquisition + FLOSS + Profit + Social + Type + DeflatedHits,
                data = g)

    rm(predictions)
    for (i in 1:nrow(google)) { predictions[i] <- coxProbability(cmodel, google[i,], totals[i]) }

    return(sum((predictions<0.50) == google$Dead) / nrow(google))
    }
print(randomCoxPredictionAccuracy(google, 1:nrow(google)))
coxBs <- boot(data=google, statistic=randomCoxPredictionAccuracy, R=200, parallel="multicore", ncpus=4)
print(coxBs)
print(boot.ci(coxBs, type="perc"))

cat("\nRanking products by Cox risk ratio...\n")
google$RiskRatio <- predict(cmodel, type="risk")
alive <- google[!google$Dead,]

cat("\nExtract the 10 living products with highest estimated relative risks:\n")
print(head(alive[order(alive$RiskRatio, decreasing=TRUE),], n=10)$Product)

cat("\nExtract the 10 living products with lowest estimated relative risk:\n")
print(head(alive[order(alive$RiskRatio, decreasing=FALSE),], n=10)$Product)

cat("\nBegin calculating specific numerical predictions about remaining lifespans..\n")
cpmodel <- cph(Surv(Days, Dead) ~ Acquisition + FLOSS + Profit + Social + Type +
                                  DeflatedHits + EarlyGoogle,
               data = google, x=TRUE, y=TRUE, surv=TRUE)
predictees <- subset(google, Product %in% c("Alerts","Blogger","FeedBurner","Scholar",
                                            "Book Search","Voice","Gmail","Analytics",
                                            "AdSense","Calendar","Alerts","Google+","Docs",
                                            "Search", "Project Glass", "Chrome", "Translate"))
# seriously ugly hack
conditionalProbability <- function (d, followupUnits) {
    chances <- rep(NA, nrow(d)) # stash results

    for (i in 1:nrow(d)) {

        # extract chance of particular subject surviving as long as it has:
        beginProb <- survest(cpmodel, d[i,], times=(d[i,]$Days))$surv
        if (length(beginProb)==0) { beginProb <- 1 } # set to a default

        tmpFollowup <- followupUnits # reset in each for loop
        while (TRUE) {
            # extract chance of subject surviving as long as it has + an arbitrary additional time-units
            endProb <- survest(cpmodel, d[i,], times=(d[i,]$Days + tmpFollowup))$surv
            # survival curve may not reach that far! 'survexp returns 'numeric(0)' if it doesn't;
            # so we shrink down 1 day and try again until 'survexp' *does* return a usable answer
            if (length(endProb)==0) { tmpFollowup <- tmpFollowup - 1} else { break }
        }

        # if 50% of all subjects survive to time t, and 20% of all survive to time t+100, say, what chance
        # does a survivor - at exactly time t - have of making it to time t+100? 40%: 0.20 / 0.50 = 0.40
        chances[i] <- endProb / beginProb
    }
    return(chances)
}
## the risks and survival estimate have been stashed in the original CSV to save computation
# google$RelativeRisk <- predict(cmodel, newdata=google, type="risk")
# google$LinearPredictor <- predict(cmodel, newdata=google, type="lp")
# google$ExpectedEvents <- predict(cmodel, newdata=google, type="expected")
# google$FiveYearSurvival <- conditionalProbability(google, 5*365.25)

# graphs survival curves for each of the 15
png(file="~/wiki/images/google/15-predicted-survivorship-curves.png", width = 1.5*480, height = 1*480)
plot(survfit(cmodel, newdata=predictees),
     xlab = "time", ylab="Survival", main="Survival curves for 15 selected Google products")
invisible(dev.off())
cat("\nPredictions for the 15 and also their relative risks:\n")
ps <- conditionalProbability(predictees, 5*365.25)
print(data.frame(predictees$Product, ps*100))
print(round(predict(cmodel, newdata=predictees, type="risk"), digits=2))

# Analysis done

cat("\nOptimizing the generated graphs by cropping whitespace & losslessly compressing them...\n")
system(paste('cd ~/wiki/images/google/ &&',
             'for f in *.png; do convert "$f" -crop',
             '`nice convert "$f" -virtual-pixel edge -blur 0x5 -fuzz 10% -trim -format',
             '\'%wx%h%O\' info:` +repage "$f"; done'))
system("optipng -o9 -fix ~/wiki/images/google/*.png", ignore.stdout = TRUE)

Leakage

While the hit-counts are a pos­si­ble form of leak­age, I ac­ci­den­tally caused a clear case of leak­age while see­ing how ran­dom forests would do in pre­dict­ing shut­downs.

One way to get data leak­age is if we in­clude the end-date; early on in my analy­sis I re­moved the Dead vari­able but it did­n’t oc­cur to me to re­move the Ended date fac­tor. The ran­dom for­est would pre­dict cor­rectly every sin­gle shut­down ex­cept for 8, for an er­ror-rate of 2%. How did it turn in this near­ly-om­ni­scient set of pre­dic­tions and why did it get those 8 wrong? Be­cause the 8 prod­ucts are cor­rectly marked in the orig­i­nal dataset as “dead” be­cause their shut­down had been an­nounced by Google, but had been sched­uled by Google to die after the day I was run­ning the code. So it turned out that the ran­dom forests were just emit­ting ‘dead’ for ‘any­thing with an end date be­fore 2013-04-04’, and alive for every­thing there­after!

library(randomForest)
rf <- randomForest(as.factor(Dead) ~ ., data=google[-1])
google[rf$predicted != google$Dead,]
#                      Product Dead    Started      Ended     Hits    Type Profit FLOSS Acquisition
# 24 Gmail Exchange ActiveSync TRUE 2009-02-09 2013-07-01   637000 service  FALSE FALSE       FALSE
# 30  CalDAV support for Gmail TRUE 2008-07-28 2013-09-16   245000 service  FALSE FALSE       FALSE
# 37                    Reader TRUE 2005-10-07 2013-07-01 79100000 service  FALSE FALSE       FALSE
# 38               Reader Play TRUE 2010-03-10 2013-07-01    43500 service  FALSE FALSE       FALSE
# 39                   iGoogle TRUE 2005-05-01 2013-11-01 33600000 service   TRUE FALSE       FALSE
# 74            Building Maker TRUE 2009-10-13 2013-06-01  1730000 service  FALSE FALSE       FALSE
# 75             Cloud Connect TRUE 2011-02-24 2013-04-30   530000 program  FALSE FALSE       FALSE
# 77   Search API for Shopping TRUE 2011-02-11 2013-09-16   217000 service   TRUE FALSE       FALSE
#    Social Days  AvgHits DeflatedHits AvgDeflatedHits
# 24  FALSE 1603   419.35    9.308e-06         -0.5213
# 30  FALSE 1876   142.86    4.823e-06         -0.4053
# 37   TRUE 2824 28868.61    7.396e-03         -2.0931
# 38  FALSE 1209    38.67    3.492e-07         -0.2458
# 39   TRUE 3106 11590.20    4.001e-03         -1.6949
# 74  FALSE 1327  1358.99    1.739e-05         -0.6583
# 75  FALSE  796   684.75    2.496e-06         -0.5061
# 77  FALSE  948   275.73    1.042e-06         -0.4080
# rf
# ...
#     Type of random forest: classification
#                      Number of trees: 500
# No. of variables tried at each split: 3
#
#         OOB estimate of  error rate: 2.29%
# Confusion matrix:
#       FALSE TRUE class.error
# FALSE   226    0     0.00000
# TRUE      8  115     0.06504

  1. One sober­ing ex­am­ple I men­tion in my : were gone within a year. I do not know what the full di­men­sion of the Reader RSS archive loss will be.↩︎

  2. Google Reader affords ex­am­ples of this lack of trans­parency on a key is­sue - Google’s will­ing­ness to sup­port Reader (ex­tremely rel­e­vant to users, and even more so to the third-party web ser­vices and ap­pli­ca­tions which re­lied on Reader to func­tion); from Buz­zFeed’s “Google’s Lost So­cial Net­work: How Google ac­ci­den­tally built a truly beloved so­cial net­work, only to steam­roll it with Google+. The sad, sur­pris­ing story of Google Reader”:

    The diffi­culty was that Reader users, while hy­per­-en­gaged with the pro­duct, never snow­balled into the tens or hun­dreds of mil­lions. Brian Shih be­came the prod­uct man­ager for Reader in the fall of 2008. “If Reader were its own star­tup, it’s the kind of com­pany that Google would have bought. Be­cause we were at Google, when you stack it up against some of these prod­ucts, it’s tiny and is­n’t worth the in­vest­ment”, he said. At one point, Shih re­mem­bers, en­gi­neers were pulled off Reader to work on OpenSo­cial, a “half-baked” de­vel­op­ment plat­form that never amounted to much. “There was al­ways a po­lit­i­cal fight in­ter­nally on keep­ing peo­ple staffed on this lit­tle project”, he re­called. Some­one hung a sign in the Reader offices that said “DAYS SINCE LAST THREAT OF CANCELLATION.” The num­ber was al­most al­ways ze­ro. At the same time, user growth - while small next to Gmail’s hun­dreds of mil­lions - more than dou­bled un­der Shi­h’s tenure. But the “se­nior types”, as Bilotta re­mem­bers, “would look at ab­solute user num­bers. They would­n’t look at mar­ket sat­u­ra­tion. So Reader was con­stantly on the chop­ping block.”

    So when news spread in­ter­nally of Read­er’s geld­ing, it was like Hem­ing­way’s line about go­ing broke: “Two ways. Grad­u­al­ly, then sud­den­ly.” Shih found out in the spring that Read­er’s in­ter­nal shar­ing func­tions - the asym­met­ri­cal fol­low­ing mod­el, en­demic com­ment­ing and lik­ing, and its ad­vanced pri­vacy set­tings - would be su­per­seded by the forth­com­ing Google+ mod­el. Of course, he was for­bid­den from breath­ing a word to users.

    Marco Ar­ment says “I’ve heard from mul­ti­ple sources that it effec­tively had a staff of zero for years”.↩︎

  3. Shih fur­ther writes on Quora:

    Let’s be clear that this has noth­ing to do with rev­enue vs op­er­at­ing costs. Reader never made money di­rectly (though you could maybe at­tribute some of Feed­burner and Ad­Sense for Feeds us­age to it), and it was­n’t the goal of the prod­uct. Reader has been fight­ing for ap­proval/­sur­vival at Google since long be­fore I was a PM for the prod­uct. I’m pretty sure Reader was threat­ened with de-staffing at least three times be­fore it ac­tu­ally hap­pened. It was often for some rea­son re­lated to so­cial:

    • 2008 - let’s pull the team off to build OpenSo­cial
    • 2009 - let’s pull the team off to build Buzz
    • 2010 - let’s pull the team off to build Google+

    It turns out they de­cided to kill it any­way in 2010, even though most of the en­gi­neers opted against join­ing G+. Iron­i­cal­ly, I think the rea­son Google al­ways wanted to pull the Reader team off to build these other so­cial prod­ucts was that the Reader team ac­tu­ally un­der­stood so­cial (and tried a lot of ex­per­i­ments over the years that in­formed the larger so­cial fea­tures at the com­pa­ny) [See Read­er’s friends im­ple­men­ta­tions v1, v2, and v3, com­ments, pri­vacy con­trols, and shar­ing fea­tures. Ac­tu­ally wait, you can’t see those any­more, since they were all ripped out­.]. Read­er’s so­cial fea­tures also evolved very or­gan­i­cally in re­sponse to users, in­stead of be­ing de­signed top-down like some of Google’s other efforts [Rob Fish­man’s Buz­zfeed ar­ti­cle has good cov­er­age of this: Google’s Lost So­cial Net­work]. I sus­pect that it sur­vived for some time after be­ing put into main­te­nance be­cause they be­lieved it could still be a use­ful source of con­tent into G+. Reader users were al­ways vo­ra­cious con­sumers of con­tent, and many of them fil­tered and shared a great deal of it. But after switch­ing the shar­ing fea­tures over to G+ (the so called “share-poca­lypse”) along with the re­designed UI, my guess is that us­age just started to fall - par­tic­u­larly around shar­ing. I know that my shar­ing ba­si­cally stopped com­pletely once the re­design hap­pened [Reader re­design: Ter­ri­ble de­ci­sion, or worst de­ci­sion? I was a lot an­grier then than I am now – now I’m just sad.]. Though Google did ul­ti­mately fix a lot of the UI is­sues, the shar­ing (and there­fore con­tent go­ing into G+) would never re­cov­er. So with dwin­dling use­ful­ness to G+, (like­ly) dwin­dling or flat­ten­ing us­age due to be­ing in main­te­nance, and Google’s big drive to fo­cus in the last cou­ple of years, what choice was there but to kill the pro­duct?

    ↩︎
  4. The sign story is con­firmed by an­other Googler; “Google Reader lived on bor­rowed time: cre­ator Chris Wetherell re­flects”:

    “When they re­placed shar­ing with +1 on Google Read­er, it was clear that this day was go­ing to come”, he said. Wetherell, 43, is amazed that Reader has lasted this long. Even be­fore the project saw the light of the day, Google ex­ec­u­tives were un­sure about the ser­vice and it was through sheer per­se­ver­ance that it squeaked out into the mar­ket. At one point, the man­age­ment team threat­ened to can­cel the project even be­fore it saw the light of the day, if there was a de­lay. “We had a sign that said, ‘days since can­cel­la­tion’ and it was there from the very be­gin­ning”, added a very san­guine Wetherell. My trans­la­tion: Google never re­ally be­lieved in the pro­ject. Google Reader started in 2005 at what was re­ally the golden age of RSS, blog­ging sys­tems and a new con­tent ecosys­tem. The big kahuna at that time was (ac­quired by ) and Google Reader was an up­start.

    ↩︎
  5. The offi­cial PR re­lease stated that too lit­tle us­age was the rea­son Reader was be­ing aban­doned. Whether this is the gen­uine rea­son has been ques­tioned by third par­ties, who ob­serve that Reader seems to drive far more traffic than an­other ser­vice which Google had yet to ax, Google+; that one app had >2m users who also had Reader ac­counts; that just one al­ter­na­tive to Reader (Feed­ly) had in ex­cess of 3 mil­lion signups post-an­nounce­ment (re­port­ed­ly, up to 4 mil­lion); and the largest of sev­eral pe­ti­tions to Google reached 148k sig­na­tures (less, though, than the >1m down­loads of the An­droid clien­t). Given that few users will sign up at Feedly specifi­cal­ly, sign a pe­ti­tion, visit the Buz­zFeed net­work, or use the apps in ques­tion, it seems likely that Reader had closer to 20m users than 2m users when its clo­sure was an­nounced. An un­known Google en­gi­neer has been quoted as say­ing in 2010 Reader had “tens of mil­lions ac­tive monthly users”. Xoogler Jenna Bilotta (left Google No­vem­ber 2011) said

    “I think the rea­son why peo­ple are freak­ing out about Reader is be­cause that Reader did stick,” she said, not­ing the wide­spread sur­prise that Google would shut down such a beloved prod­uct. “The num­bers, at least un­til I left, were still go­ing up.”

    The most pop­u­lar feed on Google Reader in March 2013 had 24.3m sub­scribers (some pix­el-count­ing of an offi­cial user-count graph & in­fer­ence from a leaked video sug­gests Reader in to­tal may’ve had >36m users in Jan 2011). Ja­son Scott in 2009 re­minded us that this lack of trans­parency is com­pletely pre­dictable: “Since the dawn of time, com­pa­nies have hired peo­ple whose en­tire job is to tell you every­thing is all right and you can com­pletely trust them and the com­pany is as sta­ble as a rock, and to do so un­til they, them­selves, are fired be­cause the com­pany is out of busi­ness.”↩︎

  6. This would not come as news to Ja­son Scott of , of course, but nev­er­the­less James Fal­lows points out that when a cloud ser­vice evap­o­rates, it’s sim­ply gone and gives an in­ter­est­ing com­par­ison:

    , in Mother Jones, on why the in­abil­ity to rely on Google ser­vices is more dis­rup­tive than the fa­mil­iar pre-cloud ex­pe­ri­ence of hav­ing fa­vorite pro­grams get or­phaned. My ex­am­ple is Lo­tus Agenda: it has offi­cially been dead for nearly 20 years, but I can still use it (if I want, in a DOS ses­sion un­der the VMware Fu­sion Win­dows em­u­la­tor on my Macs. Talk about lay­ered legacy sys­tem­s!). When a cloud pro­gram goes away, as Google Reader has done, it’s gone. There is no way you can keep us­ing your own “legacy” copy, as you could with pre­vi­ous or­phaned soft­ware.

    ↩︎
  7. From Gan­nes’s “An­other Rea­son Google Reader Died: In­creased Con­cern About Pri­vacy and Com­pli­ance”

    But at the same time, Google Reader was too deeply in­te­grated into Google Apps to spin it off and sell it, like the com­pany did last year with its SketchUp 3-D mod­el­ing soft­ware.

    mat­tbar­rie on Hacker News:

    I’m here with Alan No­ble who runs en­gi­neer­ing at Google Aus­tralia and ran the Google Reader project un­til 18 months ago. They looked at open sourc­ing it but it was too much effort to do so be­cause it’s tied to closely to Google in­fra­struc­ture. Ba­si­cally it’s been culled due to long term de­clin­ing use.

    ↩︎
  8. The sheer size & dom­i­nance of some Google ser­vices have lead to com­par­isons to nat­ural mo­nop­o­lies, such as the Econ­o­mist col­umn “Google’s Google prob­lem”. I saw this com­par­i­son mocked, but it’s worth not­ing that at least one Googler made the same com­par­i­son years be­fore. From ’s 2011, part 7, sec­tion 2:

    While some Googlers felt sin­gled out un­fairly for the at­ten­tion, the more mea­sured among them un­der­stood it as a nat­ural con­se­quence of Google’s in­creas­ing pow­er, es­pe­cially in re­gard to dis­trib­ut­ing and stor­ing mas­sive amounts of in­for­ma­tion. “It’s as if Google took over the wa­ter sup­ply for the en­tire United States”, says Mike Jones, who han­dled some of Google’s pol­icy is­sues. “It’s only fair that so­ci­ety slaps us around a lit­tle bit to make sure we’re do­ing the right thing.”

    ↩︎
  9. Specifi­cal­ly, this can be seen as a sort of is­sue of re­duc­ing : in some of the more suc­cess­ful ac­qui­si­tions, Google’s modus operandi was to take a very ex­pen­sive or highly pre­mium ser­vice and make it com­pletely free while also im­prov­ing the qual­i­ty. An­a­lyt­ics, Maps, Earth, Feed­burner all come to mind as ser­vices whose pre­de­ces­sors (mul­ti­ple, in the cases of Maps and Earth) charged money for their ser­vices (some­times a great deal). This leads to dead­weight loss as peo­ple do not use them, who would ben­e­fit to some de­gree but not to the full amount of the price (plus other fac­tors like risk­i­ness of in­vest­ing time and money into try­ing it out). Google cites fig­ures like bil­lions of users over the years for sev­eral of these for­mer­ly-premium ser­vices, sug­gest­ing the gains from re­duced dead­weight loss are large.↩︎

  10. If there is one truth of the tech in­dus­try, it’s that no gi­ant (ex­cept IBM) sur­vives for­ev­er. Death rates for all cor­po­ra­tions and non­profits are very high, but par­tic­u­larly so for tech. One blog­ger asks a good ques­tion:

    As we come to rely more and more on the In­ter­net, it’s be­com­ing clear that there is a real threat posed by ty­ing one­self to a 3rd party ser­vice. The In­ter­net is fa­mously de­signed to route around fail­ures caused by a nu­clear strike - but it can­not de­fend against a ser­vice be­ing with­drawn or a com­pany go­ing bank­rupt. It’s tempt­ing to say that mul­ti­-bil­lion dol­lar com­pa­nies like Ap­ple and Google will never dis­ap­pear - but a quick look at his­tory shows Nokia, En­ron, Am­strad, Sega, and many more which have fallen from great heights un­til they are mere shells and no longer offer the ser­vices which many peo­ple once re­lied on…I like to pose this ques­tion to my pho­tog­ra­phy friends - “What would you do if Ya­hoo! sud­denly de­cided to delete all your Flickr pho­tos?” Some of them have back­ups - most faint at the thought of all their work van­ish­ing.

    ↩︎
  11. We­ber’s con­clu­sion:

    We dis­cov­ered there’s been a to­tal of about 251 in­de­pen­dent Google prod­ucts since 1998 (avoid­ing ad­d-on fea­tures and ex­per­i­ments that merged into other pro­ject­s), and found that 90, or ap­prox­i­mately 36% of them have been can­celed. Awe­some­ly, we also col­lected 8 ma­jor flops and 14 ma­jor suc­cess­es, which means that 36% of its high­-pro­file prod­ucts are fail­ures. That’s quite the co­in­ci­dence! NOTE: We did not ma­nip­u­late data to come to this con­clu­sion. It was a happy ac­ci­dent.

    In an even more happy ac­ci­dent, my dataset of 350 prod­ucts yields 123 can­celed/shut­down en­tries, or 35%!↩︎

  12. Some have jus­ti­fied Read­er’s shut­down as sim­ply a ra­tio­nal act, since Reader was not bring­ing in any money and Google is not a char­i­ty. The truth seems to be re­lated more to Google’s lack of in­ter­est since the start - it’s hard to see how Google could pos­si­bly be able to mon­e­tize Gmail and not also mon­e­tize Read­er, which is con­firmed by two in­volved Googlers (from “Google Reader lived on bor­rowed time: cre­ator Chris Wetherell re­flects”):

    I won­der, did the com­pany (Google) and the ecosys­tem at large mis­read the tea leaves? Did the world at large see an RSS/reader mar­ket when in re­al­ity the ac­tual mar­ket op­por­tu­nity was in data and sen­ti­ment analy­sis? [Chris] Wetherell agreed. “The reader mar­ket never went past the ex­per­i­men­tal phase and none was it­er­at­ing on the busi­ness mod­el,” he said. “Mon­e­ti­za­tion abil­i­ties were never tried.”

    “There was so much data we had and so much in­for­ma­tion about the affin­ity read­ers had with cer­tain con­tent that we al­ways felt there was mon­e­ti­za­tion op­por­tu­ni­ty,” he said. Dick Cos­tolo (cur­rently CEO of Twit­ter), who worked for Google at the time (hav­ing sold Google his com­pa­ny, Feed­burn­er), came up with many mon­e­ti­za­tion ideas but they fell on deaf ears. Cos­tolo, of course is work­ing hard to mine those affin­i­ty-and-con­text con­nec­tions for Twit­ter, and is suc­ceed­ing. What Cos­tolo un­der­stood, Google and its man­darins to­tally missed, as noted in this No­vem­ber 2011 blog post by Chris who wrote:

    Reader ex­hibits the best un­paid rep­re­sen­ta­tion I’ve yet seen of a con­sumer’s re­la­tion­ship to a con­tent pro­ducer. You pay for HBO? That’s a strong sig­nal. Con­sum­ing free stuff? Read­er’s model was a dream. Even bet­ter than Net­flix. You get affin­ity (which has clear mon­e­tary val­ue) for free, and a tracked pat­tern of be­hav­ior for the act of it­er­at­ing over differ­ently sourced items - and a mech­a­nism for dis­trib­ut­ing that quickly to an os­ten­si­ble au­di­ence which did­n’t in­clude so­cial guilt or gameifi­ca­tion - along with an ex­ten­si­ble, scal­able plat­form avail­able via com­monly used web tech­nolo­gies - all of which would be an amaz­ing op­por­tu­nity for the right prod­uct vi­sion­ary. Reader is (was?) for in­for­ma­tion junkies; not just tech nerds. This mar­ket to­tally ex­ists and is weirdly un­der­-served (and is pos­si­bly afflu­en­t).

    Over­all, from just the PR per­spec­tive, Google prob­a­bly would have been bet­ter off switch­ing Reader to a sub­scrip­tion model and then even­tu­ally killing it while claim­ing the fees weren’t cov­er­ing the costs. Offhand, 3 ex­am­ples of Google adding or in­creas­ing fees come to mind: the Maps API, Talk in­ter­na­tional calls (ap­par­ently free ini­tial­ly), and App En­gine fees; the API price in­crease was even­tu­ally re­scinded as far as I know, and no one re­mem­bers the lat­ter two (not even App En­gine de­vs).↩︎

  13. , “Free­dom 0”; iron­i­cal­ly, Pil­grim (hired by Google in 2007) seems to be re­spon­si­ble for at least one of the en­tries be­ing marked dead, Google’s Doc­type tech en­cy­clo­pe­dia, since it dis­ap­peared around the time of his “in­fo­s­ui­cide” and has not been res­ur­rected - it was only par­tially FLOSS.↩︎

  14. #18 in The Col­lected Songs of Cold Moun­tain, Red Pine 2000, ISBN 1-55659-140-3↩︎

  15. Xoogler Rachel Kroll on this spike:

    I have some thoughts about the spikes on the death dates.

    Sep­tem­ber: all of the in­terns go back to school. These peo­ple who ex­ist on the fringes of the sys­tem man­age to get a lot of work done, pos­si­bly be­cause they are free of most of the over­head fac­ing real em­ploy­ees. Once they leave, it’s up to the FTEs [Full Time Em­ploy­ee] to own what­ever was cre­at­ed, and that does­n’t al­ways work. I wish I could have kept some of them and swapped them for some of the ful­l-timers.

    March/April: An­nual bonus time? That’s what it used to be, at least, and I say this as some­one who quit in May, and that was no ac­ci­dent. Same thing: peo­ple leave, and that dooms what­ever they left.

    ↩︎
  16. 0.057; but as the old crit­i­cism of NHST goes, “surely God loves the 0.057 al­most as much as the 0.050”.↩︎

  17. Specifi­cal­ly: build­ing a lo­gis­tic model on a boot­strap sam­ple and then test­ing ac­cu­racy against full Google dataset.↩︎

  18. But note that “sun­set­ting” of “con­sumer Google+” was an­nounced in Oc­to­ber 2018.↩︎

  19. I in­clude Voice even though I don’t use it or oth­er­wise find it in­ter­est­ing (my cri­te­ria for the other 10) be­cause spec­u­la­tion has been rife and be­cause a pre­dic­tion on its fu­ture was re­quested.↩︎