The Iron Law Of Evaluation And Other Metallic Rules

Problems with social experiments and evaluating them, loopholes, causes, and suggestions; non-experimental methods systematically deliver false results, as most interventions fail or have small effects.
experiments, sociology, politics, causality, bibliography, insight-porn
by: Peter H. Rossi 2012-09-182019-05-13 finished certainty: log importance: 9

“The Iron Law Of Eval­u­a­tion And Other Metal­lic Rules” is a clas­sic review paper by Amer­i­can “ Peter Rossi, a ded­i­cated pro­gres­sive and the nation’s lead­ing expert on social pro­gram eval­u­a­tion from the 1960s through the 1980s”; it dis­cusses the diffi­cul­ties of cre­at­ing a use­ful , and pro­posed some apho­ris­tic sum­mary rules, includ­ing most famous­ly:

  • The Iron law: “The expected value of any net impact assess­ment of any large scale social pro­gram is zero”
  • the Stain­less Steel law: “the bet­ter designed the impact assess­ment of a social pro­gram, the more likely is the result­ing esti­mate of net impact to be zero.”

It expands an ear­lier paper by Rossi (“Issues in the eval­u­a­tion of human ser­vices deliv­ery”, Rossi 1978), where he coined the first, “Iron Law”.

I pro­vide an anno­tated HTML ver­sion with full­text for all ref­er­ences, as well as a bib­li­og­ra­phy col­lat­ing many neg­a­tive results in social exper­i­ments I’ve found since Rossi’s paper was pub­lished (see also ).

This tran­script has been pre­pared from an orig­i­nal scan; all hyper­links are my own inser­tion.

The Iron Law


by Peter Rossi


Eval­u­a­tions of social pro­grams have a long his­to­ry, as his­tory goes in the social sci­ences, but it has been only in the last two decades that eval­u­a­tion has come close to becom­ing a rou­tine activ­ity that is a func­tion­ing part of the pol­icy for­ma­tion process. Eval­u­a­tion research has become an activ­ity that no agency admin­is­ter­ing social pro­grams can do with­out and still retain a rep­u­ta­tion as mod­ern and up to date. In acad­e­mia, eval­u­a­tion research has infil­trated into most social sci­ence depart­ments as an inte­gral con­stituent of cur­ric­u­la. In short, eval­u­a­tion has become insti­tu­tion­al­ized.

There are many ben­e­fits to social pro­grams and to the social sci­ences from the insti­tu­tion­al­iza­tion of eval­u­a­tion research. Among the more impor­tant ben­e­fits has been a con­sid­er­able increase in knowl­edge con­cern­ing social prob­lems and about how social pro­grams work (and do not [pg4] work). Along with these ben­e­fits, how­ev­er, there have also been attached some loss­es. For those con­cerned with the improve­ment of the lot of dis­ad­van­taged per­sons, fam­i­lies and social groups, the result­ing knowl­edge has pro­vided the bases for both pes­simism and opti­mism. On the pes­simistic side, we have learned that design­ing suc­cess­ful pro­grams is a diffi­cult task that is not eas­ily or often accom­plished. On the opti­mistic side, we have learned more and more about the kinds of pro­grams that can be suc­cess­fully designed and imple­ment­ed. Knowl­edge derived from eval­u­a­tions is begin­ning to guide our judg­ments con­cern­ing what is fea­si­ble and how to reach those fea­si­ble goals.

To draw some impor­tant impli­ca­tions from this knowl­edge about the work­ings of social pro­grams is the objec­tive of this paper. The first step is to for­mu­late a set of “laws” that sum­ma­rize the major trends in eval­u­a­tion find­ings. Next, a set of expla­na­tions are pro­vided for those over­all find­ings. Final­ly, we explore the con­se­quences for applied social sci­ence activ­i­ties that flow from our new knowl­edge of social pro­grams.

Some “Laws” Of Evaluation

A dra­matic but slightly over­drawn view of two decades of eval­u­a­tion efforts can be stated as a set of “laws”, each sum­ma­riz­ing some strong ten­dency that can be dis­cerned in that body of mate­ri­als. Fol­low­ing a 19th Cen­tury prac­tice that has fallen into dis­use in social sci­ence1, these laws are named after sub­stances of vary­ing dura­bil­i­ty, roughly index­ing each law’s robust­ness.

  • The Iron Law of Eval­u­a­tion: “The expected value of any net impact assess­ment of any large scale social pro­gram is zero.”

    The Iron Law arises from the expe­ri­ence that few impact assess­ments of large scale2 social pro­grams have found that the pro­grams in ques­tion had any net impact. The law also means that, based on the eval­u­a­tion efforts of the last twenty years, the best a pri­ori esti­mate of the net impact assess­ment of any pro­gram is zero, i.e., that the pro­gram will have no effect.

  • The Stain­less Steel Law of Eval­u­a­tion: “The bet­ter designed the impact assess­ment of a social pro­gram, the more likely is the result­ing esti­mate of net impact to be zero.”

    This law means that the more tech­ni­cally rig­or­ous the net impact assess­ment, the more likely are its results to be zero—or not effect. Specifi­cal­ly, this law implies that esti­mat­ing net impacts through , the avowedly best approach to esti­mat­ing net impacts, is more likely to show zero effects than other less rig­or­ous approach­es. [pg5]

  • The Brass Law of Eval­u­a­tion: “The more social pro­grams are designed to change indi­vid­u­als, the more likely the net impact of the pro­gram will be zero.”

    This law means that social pro­grams designed to reha­bil­i­tate indi­vid­u­als by chang­ing them in some way or another are more likely to fail. The Brass Law may appear to be redun­dant since all pro­grams, includ­ing those designed to deal with indi­vid­u­als, are cov­ered by the Iron Law. This redun­dancy is intended to empha­size the espe­cially diffi­cult task in design­ing and imple­ment­ing effec­tive pro­grams that are designed to reha­bil­i­tate indi­vid­u­als.

  • The Zinc Law of Eval­u­a­tion: “Only those pro­grams that are likely to fail are eval­u­at­ed.”

    Of the sev­eral metal­lic laws of eval­u­a­tion, the zinc law has the most opti­mistic slant since it implies that there are effec­tive pro­grams but that such effec­tive pro­grams are never eval­u­at­ed. It also implies that if a social pro­gram is effec­tive, that char­ac­ter­is­tic is obvi­ous enough and hence pol­icy mak­ers and oth­ers who spon­sor and fund eval­u­a­tions decide against eval­u­a­tion.

It is pos­si­ble to for­mu­late a num­ber of addi­tional laws of eval­u­a­tion, each attached to one or another of a vari­ety of sub­stances, vary­ing in strength from strong, robust met­als to flimsy mate­ri­als. The sub­stances involved are only lim­ited by one’s imag­i­na­tion. But, if such laws are to mir­ror the major find­ings of the last two decades of eval­u­a­tion research they would all carry the same mes­sage: The laws would claim that a review of the his­tory of the last two decades of efforts to eval­u­ate major social pro­grams in the United States sus­tain the propo­si­tion that over this period the Amer­i­can estab­lish­ment of pol­icy mak­ers, agency offi­cials, pro­fes­sion­als and social sci­en­tists did not know how to design and imple­ment social pro­grams that were min­i­mally effec­tive, let alone spec­tac­u­larly so.

How Firm Are The Metallic Laws Of Evaluation?

How seri­ously should we take the metal­lic laws? Are they sim­ply the social sci­ence ana­logue of poetic license, intended to pro­vide dra­matic empha­sis? Or, do the laws accu­rately sum­ma­rize the last two decades’ eval­u­a­tion expe­ri­ences?

First of all, viewed against the evi­dence, the iron law is not entirely rigid. True, most impact assess­ments con­firm to the iron law’s dic­tates in show­ing at best mar­ginal effects and all too often no effects at all. There are even a few eval­u­a­tions that have shown effects in the wrong direc­tions, [pg6] oppo­site to the desired effects. Some of the fail­ures of large scale pro­grams have been par­tic­u­larly dis­ap­point­ing because of the large invest­ments of time and resources involved: Man­power retrain­ing pro­grams have not been shown to improve earn­ings or employ­ment prospects of par­tic­i­pants (Wes­t­at, 1976–1980). Most of the attempts to reha­bil­i­tate pris­on­ers have failed to reduce recidi­vism (Lip­ton, Mar­tin­son, and Wilks, 1975). Most edu­ca­tional inno­va­tions have not been shown to improve stu­dent learn­ing appre­cia­bly over tra­di­tional meth­ods (Raizen and Rossi, 1981).

But, there are also many excep­tions to the iron rule! The “iron” in the Iron Law has shown itself to be some­what spongy and there­fore eas­i­ly, although not fre­quent­ly, bro­ken. Some social pro­grams have shown pos­i­tive effects in the desired direc­tions, and there are even some quite spec­tac­u­lar suc­cess­es: the Amer­i­can old age pen­sion sys­tem plus Medicare has dra­mat­i­cally improved the lives of our older cit­i­zens. Med­ic­aid has man­aged to deliver med­ical ser­vices to the poor to the extent that the neg­a­tive cor­re­la­tion between income and con­sump­tion of med­ical ser­vices has declined dra­mat­i­cally since enact­ment. The fam­ily plan­ning clin­ics sub­si­dize by the fed­eral gov­ern­ment were effec­tive in reduc­ing the num­ber of births in areas where they were imple­mented (Cutright and Jaffe, 1977). There are also human ser­vices pro­grams that have been shown to be effec­tive, although mainly on small scale, pilot runs: for exam­ple, the Min­neapo­lis exper­i­ment on the police han­dling of fam­ily vio­lence showed that if the police placed the offend­ing abuser in cus­tody over night that the offender was less likely to show up as an accused offender over the suc­ceed­ing six months (Sher­man and Berk, 19843). A meta-e­val­u­a­tion of psy­chother­apy showed that on the aver­age, per­sons in psy­chother­a­py—no mat­ter what brand—were a third of a stan­dard devi­a­tion improved over con­trol groups that did not have any ther­apy (Smith, Glass, and Miller, 1980). In most of the eval­u­a­tions of man­power train­ing pro­grams, women return­ing to the labor force ben­e­fited pos­i­tively com­pared to women who did not take the cours­es, even though in gen­eral such pro­grams have not been suc­cess­ful. Even is now begin­ning to show some pos­i­tive ben­e­fits after many years of equiv­o­cal find­ings. And so it goes on, through a rel­a­tively long list of suc­cess­ful pro­grams.

But even in the case of suc­cess­ful social pro­grams, the sizes of the net effects have not been spec­tac­u­lar. In the social pro­gram field, noth­ing has yet been invented which is as effec­tive in its way as the was for the field of pub­lic health. In short, as is well known (and widely deplored) we are not on the verge of wip­ing out the social scourges of our time: igno­rance, pover­ty, crime, depen­den­cy, or men­tal ill­ness show great promise to be with us for some time to come.

The Stain­less Steel Law appears to be more likely to hold up over a [pg7] large series of cases than the more gen­eral Iron Law. This is because the fiercest com­pe­ti­tion as an expla­na­tion for the seem­ing suc­cess of any pro­gram—e­spe­cially human ser­vices pro­gram­s—or­di­nar­ily is either self- or admin­is­tra­tor-s­e­lec­tion of clients. In other words, if one finds that a pro­gram appears to be effec­tive, the most likely alter­na­tive expla­na­tion to judg­ing the pro­gram as the cause of that suc­cess is that the per­sons attracted to that pro­gram were likely to get bet­ter on their own or that the admin­is­tra­tors of that pro­gram chose those who were already on the road to recov­ery as clients. As the bet­ter research design­s—­par­tic­u­larly ran­dom­ized exper­i­ments—e­lim­i­nate that com­pe­ti­tion, the less likely is a pro­gram to show any pos­i­tive net effect. So the bet­ter the research design, the more likely the net impact assess­ment is likely to be zero.

How about the Zinc Law of Eval­u­a­tion? First, it should be pointed out that this law is impos­si­ble to ver­ify in any lit­eral sense. The only way that one can be rel­a­tively cer­tain that a pro­gram is effec­tive is to eval­u­ate it, and hence the propo­si­tion that only ineffec­tive pro­grams are eval­u­ated can never be proven.

How­ev­er, there is a sense in which the Zinc law is cor­rect. If the a pri­ori, beyond-any-doubt expec­ta­tions of deci­sion mak­ers and agency heads is that a pro­gram will be effec­tive, there is lit­tle chance that the pro­gram will be eval­u­ated at all. Our most suc­cess­ful social pro­gram, social secu­rity pay­ments to the aged, has never been eval­u­ated in a rig­or­ous sense. It is “well known” that the pro­gram man­ages to raise the incomes of retired per­sons and their fam­i­lies, and “it stands to rea­son” that this increases in income is greater than what would have hap­pened, absent the social secu­rity sys­tem.

Eval­u­a­tion research is the legit­i­mate child of skep­ti­cism, and where there is faith, research is not called upon to make a judg­ment. Indeed, the his­tory of the income main­te­nance exper­i­ments bears this point out. Those exper­i­ments were not under­taken to find out whether the main pur­pose of the pro­posed pro­gram could be achieved: that is, no one doubted that pay­ments would pro­vide income to poor peo­ple—in­deed, pay­ments by defi­n­i­tion are income, and even social sci­en­tists are not inclined to waste resources inves­ti­gat­ing tau­tolo­gies. Fur­ther­more, no one doubted that pay­ments could be cal­cu­lated and checks could be deliv­ered to house­holds. The main pur­pose of the exper­i­ment was to esti­mate the sizes of cer­tain antic­i­pated side effects of the pay­ments, about which econ­o­mists and pol­icy mak­ers were uncer­tain—how much of a work dis­in­cen­tive effect would be gen­er­ated by the pay­ments and whether the pay­ments would affect other aspects of the house­holds in unde­sir­able ways—­for instance, increas­ing the divorce rate among par­tic­i­pants.

In short, when we look at the evi­dence for the metal­lic laws, the evi­dence appears not to sus­tain their seem­ingly rigid char­ac­ter, but the [pg8] evi­dence does sus­tain the “laws” as sta­tis­ti­cal reg­u­lar­i­ties. Why this should be the case, is the topic to be explored in the remain­der of this paper.

Is There Something Wrong With Evaluation Research?

A pos­si­bil­ity that deserves very seri­ous con­sid­er­a­tion is that there is some­thing rad­i­cally wrong with the ways in which we go about con­duct­ing eval­u­a­tions. Indeed, this argu­ment is the foun­da­tion of a revi­sion­ist school of eval­u­a­tion, com­posed of eval­u­a­tors who are intent on call­ing into ques­tion the main body of method­olog­i­cal pro­ce­dures used in eval­u­a­tion research, espe­cially those that empha­size quan­ti­ta­tive and par­tic­u­larly exper­i­men­tal approaches to the esti­ma­tion of net impacts. The revi­sion­ists include such per­sons as Michael Pat­ton (1980) and Ego Guba (1981). Some of the revi­sion­ists are reformed num­ber crunch­ers who have seen the errors of their ways and have been reborn as qual­i­ta­tive researchers. Oth­ers have come from social sci­ence dis­ci­plines in which qual­i­ta­tive ethno­graphic field meth­ods have been dom­i­nant.

Although the issue of the appro­pri­ate­ness of social sci­ence method­ol­ogy is an impor­tant one, so far the revi­sion­ist argu­ments fall far short of being fully con­vinc­ing. At the root of the revi­sion­ist argu­ment appears to be that the revi­sion­ists find it diffi­cult to accept the find­ings that most social pro­grams, when eval­u­ate for impact assess­ment by rig­or­ous quan­ti­ta­tive eval­u­a­tion pro­ce­dures, fail to reg­is­ter main effects: hence the defects must be in the method of mak­ing the esti­mates.4 This argu­ment per se is an inter­est­ing one, and deserves atten­tion: all pro­ce­dures need to be con­tin­u­ally re-e­val­u­at­ed. There are some obvi­ous defi­cien­cies in most eval­u­a­tions, some of which are inher­ent in the pro­ce­dures employed. For exam­ple, a pro­gram that is con­stantly chang­ing and evolv­ing can­not ordi­nar­ily be rig­or­ously eval­u­ated since the treat­ment to be eval­u­ate can­not be clearly defined. Such pro­grams either require new eval­u­a­tion pro­ce­dures or should not be eval­u­ated at all.

The weak­ness of the revi­sion­ist approaches lies in their pro­posed solu­tions to these defi­cien­cies. Crit­i­ciz­ing quan­ti­ta­tive approaches for their wood­en­ness and inflex­i­bil­i­ty, they pro­pose to replace cur­rent meth­ods with pro­ce­dures that have even greater and more obvi­ous defi­cien­cies. The qual­i­ta­tive pro­ce­dures they pro­pose are not exempt from issues of inter­nal and exter­nal valid­ity and ordi­nar­ily do not attempt to address these thorny prob­lems. Indeed, the pro­ce­dures which they advance as sub­sti­tutes for the main­stream method­ol­ogy are usu­ally vaguely described, [pg9] con­sti­tut­ing an almost mys­ti­cal advo­cacy of the virtues of qual­i­ta­tive approach­es, with­out clear dis­cus­sion of the spe­cific ways in which such pro­ce­dures meet valid­ity cri­te­ria. In addi­tion, many appear to adopt pro­gram oper­a­tor per­spec­tives on effec­tive­ness, rea­son­ing that any effort to improve social con­di­tions must have some effect, with the bur­den of proof placed on the eval­u­a­tion researcher to find out what those effects might be.

Although many of their argu­ments con­cern­ing the wood­en­ness of many quan­ti­ta­tive researches are cogent and well tak­en, the main revi­sion­ist argu­ments for an alter­na­tive method­ol­ogy are uncon­vinc­ing: hence one must look else­where than to eval­u­a­tion method­ol­ogy for the rea­sons for the fail­ure of social pro­grams to pass muster before the bar of impact assess­ments.

Sources Of Program Failures

Start­ing with the con­vic­tion that the many find­ings of zero impact are real, we are led inex­orably to the con­clu­sion that the faults must lie in the pro­grams. Three kinds of fail­ure can be iden­ti­fied, each a major source of the observed lack of impact:

The first two types of faults that lead a pro­gram to fail stem from prob­lems in social sci­ence the­ory and the third is a prob­lem in the orga­ni­za­tion of social pro­grams:

  1. Faults in Prob­lem the­ory: The pro­gram is built upon a faulty under­stand­ing of the social processes that give rise to the prob­lem to which the social pro­gram is osten­si­bly addressed;
  2. Faults in Pro­gram the­ory: The pro­gram is built upon a faulty under­stand­ing of how to trans­late prob­lem the­ory into spe­cific pro­grams.
  3. Faults in Pro­gram Imple­men­ta­tion: There are faults in the orga­ni­za­tions, resources lev­els and/or activ­i­ties that are used to deliver the pro­gram to its intended ben­e­fi­cia­ries.

Note that the term the­ory is used above in a fairly loose way to cover all sorts of empir­i­cally grounded gen­er­al­ized knowl­edge about a top­ic, and is not lim­ited to for­mal propo­si­tions.

Every social pro­gram, implic­itly or explic­it­ly, is based on some under­stand­ing of the social prob­lem involved and some under­stand­ing of the pro­gram. If one fails to arrive at an appro­pri­ate under­stand­ing of either, the pro­gram in ques­tion will undoubt­edly fail. In addi­tion, every pro­gram [pg10] is given to some orga­ni­za­tion to imple­ment. Fail­ures to pro­vide enough resources, or to insure that the pro­gram is deliv­ered with suffi­cient fidelity can also lead to find­ings of ineffec­tive­ness.

Problem Theory

Prob­lem the­ory con­sists of the body of empir­i­cally tested under­stand­ing of the social prob­lem that under­lies the design of the pro­gram in ques­tion. For exam­ple, the prob­lem the­ory that was the under­pin­ning for the many attempts at pris­oner reha­bil­i­ta­tion tried in the last two decades was that crim­i­nal­ity was a per­son­al­ity dis­or­der. Even though there was a lot of evi­dence for this view­point, it also turned out that the­ory is not rel­e­vant either to under­stand­ing crime rates or to the design of crime pol­i­cy. The changes in crime rates do not reflect mas­sive shifts in per­son­al­ity char­ac­ter­is­tics of the Amer­i­can pop­u­la­tion, nor does the per­son­al­ity dis­or­der the­ory of crime lead to clear impli­ca­tions for crime reduc­tion poli­cies. Indeed, it is likely that large scale per­son­al­ity changes are beyond the reach of social pol­icy insti­tu­tions in a demo­c­ra­tic soci­ety.

The adop­tion of this the­ory is quite under­stand­able. For exam­ple, how else do we account for the fact that per­sons seem­ingly exposed to the same influ­ences do not show the same crim­i­nal (or non­crim­i­nal) ten­den­cies? But the the­ory is not use­ful for under­stand­ing the social dis­tri­b­u­tion of crime rates by gen­der, socio-e­co­nomic lev­el, or by age.

Program Theory

Pro­gram the­ory links together the activ­i­ties that con­sti­tute a social pro­gram and desired pro­gram out­comes. Obvi­ous­ly, pro­gram the­ory is also linked to prob­lem the­o­ry, but is par­tially inde­pen­dent. For exam­ple, given the prob­lem the­ory that diag­nosed crim­i­nal­ity is a per­son­al­ity dis­or­der, a match­ing pro­gram the­ory would have as its aims per­son­al­ity change ori­ented ther­a­py. But there are many spe­cific ways in which ther­apy can be defined and at many differ­ent points in the his­tory of indi­vid­u­als. At the one extreme of the life­line, one might attempt pre­ven­tive men­tal health work directed toward young chil­dren; at the other extreme, one might pro­vide psy­chi­atric treat­ment for pris­on­ers or set up ther­a­peu­tic groups in prison for con­victed offend­ers.


The third major source of fail­ure is orga­ni­za­tional in char­ac­ter and has to do with the fail­ure to imple­ment prop­erly pro­grams. Human ser­vices [pg11] pro­grams are noto­ri­ously diffi­cult to deliver appro­pri­ately to the appro­pri­ate clients. A well designed pro­gram that is based on cor­rect prob­lem and pro­gram the­o­ries may sim­ply be imple­mented improp­er­ly, includ­ing not imple­ment­ing any pro­gram at all. Indeed, in the early days of the , many exam­ples were found of non-pro­gram­s—the fail­ure to imple­ment any­thing at all.

Note that these three sources of fail­ure are nested to some degree:

  1. An incor­rect under­stand­ing of the social prob­lem being addressed is clearly a major fail­ure that inval­i­dates a cor­rect pro­gram the­ory and an excel­lent imple­men­ta­tion.
  2. No mat­ter how good the prob­lem the­ory may be, an inap­pro­pri­ate pro­gram the­ory will lead to fail­ure.
  3. And, no mat­ter how good the prob­lem and pro­gram the­o­ries, a poor imple­men­ta­tion will also lead to fail­ure.

Sources of Theory Failure

A major rea­son for fail­ures pro­duce through incor­rect prob­lem and pro­gram the­o­ries lies in the seri­ous under­-de­vel­op­ment of pol­icy related social sci­ence the­o­ries in many of the basic dis­ci­plines. The major prob­lem with much basic social sci­ence is that social sci­en­tists have tended to ignore pol­icy related vari­ables in build­ing the­o­ries because pol­icy related vari­ables account for so lit­tle of the vari­ance in the behav­ior in ques­tion. It does not help the con­struc­tion of social pol­icy any to know that a major deter­mi­nant of crim­i­nal­ity is age, because there is lit­tle, if any­thing, that pol­icy can do about the age dis­tri­b­u­tion of a pop­u­la­tion, given a com­mit­ment to our cur­rent demo­c­ra­t­ic, lib­eral val­ues. There are notable excep­tions to this gen­er­al­iza­tion about social sci­ence: eco­nom­ics and polit­i­cal sci­ence have always been closely atten­tive to pol­icy con­sid­er­a­tions; this indict­ment con­cerns mainly such fields as soci­ol­o­gy, anthro­pol­ogy and psy­chol­o­gy.

Inci­den­tal­ly, this gen­er­al­iza­tion about social sci­ence and social sci­en­tists should warn us not to expect too much from changes in social pol­i­cy. This impli­ca­tion is quite impor­tant and will be taken up later on in this paper.

But the major rea­son why pro­grams fail through fail­ures in prob­lem and pro­gram the­o­ries is that the design­ers of pro­grams are ordi­nar­ily ama­teurs who know even less than the social sci­en­tists! There are numer­ous exam­ples of social pro­grams that were con­cocted by well mean­ing ama­teurs (but ama­teurs nev­er­the­less). A prime exam­ple are , an inven­tion of the , appar­ently [pg12] under­taken with­out any input from the , the agency that was given the man­date to admin­is­ter the pro­gram. Sim­i­larly with (CETA) and its suc­ces­sor, the cur­rent (JPTA) pro­gram, both of which were designed by rank ama­teurs and then given over to the to run and admin­is­ter. Of course, some of the ama­teurs were advised by social sci­en­tists about the pro­grams in ques­tion, so the social sci­en­tists are not com­pletely blame­less.

The ama­teurs in ques­tion are the leg­is­la­tors, judi­cial offi­cials, and other pol­icy mak­ers who ini­ti­ate pol­icy and pro­gram changes. The main prob­lem with ama­teurs lies not so much in their ama­teur sta­tus but in the fact that they may know lit­tle or noth­ing about the prob­lem in ques­tion or about the pro­grams they design. Social sci­ence may not be an extra­or­di­nar­ily well devel­oped set of dis­ci­plines, but social sci­en­tists do know some­thing about our soci­ety and how it works, knowl­edge that can prove use­ful in the design of pol­icy and pro­grams that may have a chance to be suc­cess­fully effec­tive.

Our social pro­grams seem­ingly are designed by pro­ce­dures that lie some­where in between set­ting mon­keys to typ­ing mind­lessly on type­writ­ers in the hope that addi­tional Shake­spearean plays will even­tu­ally be pro­duced, and Edis­on­ian tri­al-and-er­ror pro­ce­dures in which one tac­tic after another is tried in the hope of find­ing out some method that works. Although the Edis­on­ian par­a­digm is not highly regarded as a sci­en­tific strat­egy by the philoso­phers of sci­ence, there is much to rec­om­mend it in a his­tor­i­cal period in which good the­ory is yet to devel­op. It is also a strat­egy that allows one to learn from errors. Indeed, eval­u­a­tion is very much a part of an Edis­on­ian strat­egy of start­ing new pro­grams, and attempt­ing to learn from each tri­al.5

Problem Theory Failures

One of the more per­sis­tent fail­ures in prob­lem the­ory is to under­-es­ti­mate the com­plex­ity of the social world. Most of the social prob­lems with which we deal are gen­er­ated by very com­plex causal processes involv­ing inter­ac­tions of a very com­plex sort among soci­etal lev­el, com­mu­nity lev­el, and indi­vid­ual level process. In all like­li­hood there are bio­log­i­cal level processes involved as well, how­ever much our lib­eral ide­ol­ogy is repelled by the idea. The con­se­quence of under­-es­ti­mat­ing the com­plex­ity of the prob­lem is often to over-es­ti­mate our abil­i­ties to affect the amount and course of the prob­lem. This means that we are overly opti­mistic about how much of an effect even the best of social pro­grams can expect to achieve. It [pg13] also means that we under­-de­sign our eval­u­a­tions, run­ning the risk of com­mit­ting : that is, not hav­ing enough in our eval­u­a­tion research designs to be able to detect reli­ably those small that we are likely to encounter.

It is instruc­tive to con­sider the exam­ple of . In the last two decades, we have learned a great deal about the crime prob­lem through our attempts by ini­ti­at­ing one social pro­gram after another to halt the ris­ing crime rate in our soci­ety. The end result of this series of tri­als has largely failed to have [sub­stan­tial] impacts on the crime rates. The research effort has yielded a great deal of empir­i­cal knowl­edge about crime and crim­i­nals. For exam­ple, we now know a great deal about the demo­graphic char­ac­ter­is­tics of crim­i­nals and their vic­tims. But, we still have only the vaguest ideas about why the crime rates rose so steeply in the period between 1970 and 1980 and, in the last few years, have started what appears to be a grad­ual decline. We have also learned that the crim­i­nal jus­tice sys­tem has been given an impos­si­ble task to per­form and, indeed, prac­tices a whole­sale form of decep­tion in which every­one acqui­esces. It has been found that most per­pe­tra­tors of most crim­i­nal acts go unde­tect­ed, when detected go unpros­e­cut­ed, and when pros­e­cuted go unpun­ished. Fur­ther­more, most pros­e­cuted and sen­tenced crim­i­nals are dealt with by plea bar­gain­ing pro­ce­dures that are just in the last decade get­ting for­mal recog­ni­tion as occur­ring at all. After decades of sub rosa exis­tence, plea bar­gain­ing is begin­ning to get offi­cial recog­ni­tion in the crim­i­nal code and judi­cial inter­pre­ta­tions of that code.

But most of what we have learned in the past two decades amounts to a bet­ter descrip­tion of the crime prob­lem and the crim­i­nal jus­tice sys­tem as it presently func­tions. There is sim­ply no doubt about the impor­tance of this detailed infor­ma­tion: it is going to be the foun­da­tion of our under­stand­ing of crime; but, it is not yet the basis upon which to build poli­cies and pro­grams that can lessen the bur­den of crime in our soci­ety.

Per­haps the most impor­tant les­son learned from the descrip­tive and eval­u­a­tive researches of the past two decades is that crime and crim­i­nals appear to be rel­a­tively insen­si­tive to the range of pol­icy and pro­gram changes that have been eval­u­ated in this peri­od. This means that the prospects for sub­stan­tial improve­ments in the crime prob­lem appear to be slight, unless we gain bet­ter the­o­ret­i­cal under­stand­ing of crime and crim­i­nals. That is why the Iron Law of Eval­u­a­tion appears to be an excel­lent gen­er­al­iza­tion for the field of social pro­grams aimed at reduc­ing crime and lead­ing crim­i­nals to the straight and nar­row way of life. The knowl­edge base for devel­op­ing effec­tive crime poli­cies and pro­grams sim­ply does not exist; and hence in this field, we are con­demned—hope­fully tem­porar­i­ly—to Edis­on­ian trial and error.


Program Theory And Implementation Failures

As defined ear­lier, pro­gram the­ory fail­ures are trans­la­tions of a proper under­stand­ing of a prob­lem into inap­pro­pri­ate pro­grams, and pro­gram imple­men­ta­tion fail­ures arise out of defects in the deliv­ery sys­tem used. Although in prin­ci­ple it is pos­si­ble to dis­tin­guish pro­gram the­ory fail­ures from pro­gram imple­men­ta­tion fail­ures, in prac­tice it is diffi­cult to do so. For exam­ple, a cor­rect pro­gram may be incor­rectly deliv­ered, and hence would con­sti­tute a “pure” exam­ple of imple­men­ta­tion fail­ure, but it would be diffi­cult to iden­tify this case as such, unless there were some instances of cor­rect deliv­ery. Hence both pro­gram the­ory and pro­gram imple­men­ta­tion fail­ures will be dis­cussed together in this sec­tion.

These kinds of fail­ures are likely the most com­mon causes of ineffec­tive pro­grams in many fields. There are many ways in which pro­gram the­ory and pro­gram imple­men­ta­tion fail­ures can occur. Some of the more com­mon ways are listed below.

Wrong Treatment

This occurs when the treat­ment is sim­ply a seri­ously flawed trans­la­tion of the prob­lem the­ory into a pro­gram. One of the best exam­ples is the hous­ing allowance exper­i­ment in which the exper­i­menters attempted to moti­vate poor house­holds to move into higher qual­ity hous­ing by offer­ing them a rent sub­sidy, con­tin­gent on their mov­ing into hous­ing that met cer­tain qual­ity stan­dards (Struyk and Ben­dick, 1981). The exper­i­menters found that only a small por­tion of the poor house­holds to whom this offer was made actu­ally moved to bet­ter hous­ing and thereby qual­i­fied for and received hous­ing sub­sidy pay­ments. After much econo­met­ric cal­cu­la­tion, this unex­pected out­come was found to have been appar­ently gen­er­ated by the fact that the exper­i­menters unfor­tu­nately did not take into account that the costs of mov­ing were far from zero. When the antic­i­pated dol­lar ben­e­fits from the sub­sidy were com­pared to the net ben­e­fits, after tak­ing into account the costs of mov­ing, the net ben­e­fits were in a very large pro­por­tion of the cases uncom­fort­ably close to zero and in some instances neg­a­tive. Fur­ther­more, the hous­ing stan­dards applied almost totally missed the point. They were tech­ni­cal stan­dards that often char­ac­ter­ized hous­ing as sub­-s­tan­dard that was quite accept­able to the house­holds involved. In other words, these were stan­dards that were regarded as irrel­e­vant by the clients. It was unrea­son­able to assume that house­holds would under­take to move when there was no push of dis­sat­is­fac­tion from the hous­ing occu­pied and no sub­stan­tial net pos­i­tive ben­e­fit in dol­lar [pg15] terms for doing so. Inci­den­tal­ly, the fact that poor fam­i­lies with lit­tle for­mal edu­ca­tion were able to make deci­sions that were con­sis­tent with the out­comes of highly tech­ni­cal econo­met­ric cal­cu­la­tions improves one’s appre­ci­a­tion of the innate intel­lec­tual abil­i­ties of that pop­u­la­tion.

Right Treatment But Insufficient Dosage

A very recent set of trial polic­ing pro­grams in Hous­ton, Texas and Newark, New Jer­sey exem­pli­fies how pro­grams may fail not so much because they were admin­is­ter­ing the wrong treat­ment but because the treat­ment was frail and puny (Po­lice Foun­da­tion, 1985). Part of the goals of the pro­gram was to pro­duce a more pos­i­tive eval­u­a­tion of local police depart­ments in the views of local res­i­dents. Sev­eral differ­ent treat­ments were attempt­ed. In Hous­ton, the police attempted to meet the pre­sumed needs of vic­tims of crime by hav­ing a police offi­cer call them up a week or so after a crime com­plaint was received to ask “how they were doing” and to offer help in “any way”. Over a period of a year, the police man­aged to con­tact about 230 vic­tims, but the help they could offer con­sisted mainly of refer­rals to other agen­cies. Fur­ther­more, the crimes in ques­tion were mainly prop­erty thefts with­out per­sonal con­tact between vic­tims and offend­ers, with the main request for aid being requests to speed up the return of their stolen prop­er­ty. Any­one who knows even a lit­tle bit about prop­erty crime in the United States would know that the police do lit­tle or noth­ing to recover stolen prop­erty mainly because there is no way they can do so. Since the callers from the police depart­ment could not offer any sub­stan­tial aid to rem­edy the prob­lems caused by the crimes in ques­tion, the treat­ment deliv­ered by the pro­gram was essen­tially zero. It goes with­out say­ing that those con­tacted by the police offi­cers did not differ from ran­domly selected con­trol­s—who had also been vic­tim­ized but who had not been called by the police—in their eval­u­a­tion of the Hous­ton Police Depart­ment.

It seems likely that the treat­ment admin­is­tered, namely expres­sions of con­cern for the vic­tims of crime, admin­is­tered in a per­sonal face-to-face way, would have been effec­tive if the police could have offered sub­stan­tial help to the vic­tims.

Counter-acting Delivery System

It is obvi­ous that any pro­gram con­sists not only of the treat­ment intended to be deliv­ered, but it also con­sists of the deliv­ery sys­tem and what­ever is done to clients in the deliv­ery of ser­vices. Thus the income main­te­nance exper­i­ments’ treat­ments con­sist not only of the pay­ments, but the entire sys­tem of monthly income reports required of the clients, [pg16] the quar­terly inter­views and the annual income reviews, as well as the pay­ment sys­tem and its rules. In that par­tic­u­lar case, it is likely that the pay­ments dom­i­nated the pay­ment sys­tem, but in other cases that might not be so, with the deliv­ery sys­tem pro­foundly alter­ing the impact of the treat­ment.

Per­haps the most egre­gious exam­ple was the group coun­sel­ing pro­gram run in Cal­i­for­nia pris­ons dur­ing the 1960s (Kasse­baum, Ward, and Wilner, 1972). Guards and other prison employ­ees were used as coun­sel­ing group lead­ers, in ses­sions in which all par­tic­i­pants—pris­on­ers and guard­s—were asked to be frank and can­did with each oth­er! There are many rea­sons for the abysmal fail­ure6 of this pro­gram to affect either crim­i­nals’ behav­ior within prison or dur­ing their sub­se­quent period of parole, but among the lead­ing con­tenders for the role of vil­lain was the prison sys­tem’s use of guards as ther­a­pists.

Another exam­ple is the fail­ure of tran­si­tional aid pay­ments to released pris­on­ers when the pay­ment sys­tem was run by the state employ­ment secu­rity agen­cy, in con­trast to the strong pos­i­tive effect found when run by researchers (Rossi, Berk, and Leni­han, 1980). In a ran­dom­ized exper­i­ment run by social researchers in Bal­ti­more, the pro­vi­sion of 3 months of min­i­mal sup­port pay­ments low­ered the re-ar­rest rate by 8 per­cent, a small decre­ment, but a [sta­tis­ti­cal­ly]-sig­nifi­cant one that was cal­cu­lated to have very high cost to ben­e­fit ratios. When the Depart­ment of Labor wisely decided that another ran­dom­ized exper­i­ment should be run to see whether YOAA—“Your Ordi­nary Amer­i­can Agency”—could achieve the same results, large scale exper­i­ments in Texas and Geor­gia showed that putting the treat­ment in the hands of the employ­ment secu­rity agen­cies in those two states can­celed the pos­i­tive effects of the treat­ment. The pro­ce­dure which pro­duced the fail­ure was a sim­ple one: the pay­ments were made con­tin­gent on being unem­ployed, as the employ­ment secu­rity agen­cies usu­ally admin­is­tered unem­ploy­ment ben­e­fits, cre­at­ing a strong work dis­in­cen­tive effect with the unfor­tu­nate con­se­quence of a longer period of unem­ploy­ment for exper­i­men­tals as com­pared to their ran­dom­ized con­trols and hence a higher than expected re-ar­rest rate.

Pilot and Production Runs

The last exam­ple can be sub­sumed under a more gen­eral point—­name­ly, given that a treat­ment is effec­tive in a pilot test does not mean that when turned over to YOAA, effec­tive­ness can be main­tained. This is the les­son to be derived from the tran­si­tional aid exper­i­ments in Texas and Geor­gia and from pro­grams such as 7. In the lat­ter pro­gram lead­ing teach­ing spe­cial­ists were asked to develop ver­sions of their teach­ing meth­ods to be imple­mented in actual [pg17] school sys­tems. Despite gen­er­ous sup­port and will­ing coop­er­a­tion from their schools, the researchers were unable to get work­able ver­sions of their teach­ing strate­gies into place until at least a year into the run­ning of the pro­gram. There is a big differ­ence between run­ning a pro­gram on a small scale with highly skilled and very devoted per­son­nel and run­ning a pro­gram with the lesser skilled and less devoted per­son­nel that YOAA ordi­nar­ily has at its dis­pos­al. Pro­grams that appears to be very promis­ing when run by the per­sons who devel­oped them, often turn out to be dis­ap­point­ments when turned over to line agen­cies.

Inadequate Reward System

The inter­nally defined reward sys­tem of an orga­ni­za­tion has a strong effect on what activ­i­ties are assid­u­ously pur­sued and those that are char­ac­ter­ized by “benign neglect”. The fact that an agency is directed to engage in some activ­ity does not mean that it will do so unless the reward sys­tem within that orga­ni­za­tion actively fos­ters com­pli­ance. Indeed, there are numer­ous exam­ples of reward sys­tems that do not fos­ter com­pli­ance.

Per­haps one of the best exam­ples was the expe­ri­ence of sev­eral police depart­ments with the decrim­i­nal­iza­tion of pub­lic intox­i­ca­tion. Both the Dis­trict of Colum­bia and Min­neapolis—a­mong other juris­dic­tion­s—re­scinded their ordi­nances that defined pub­lic drunk­en­ness as mis­de­meanors, set­ting up detox­i­fi­ca­tion cen­ters to which police were asked to bring per­sons who were found to be drunk on the streets. Under the old sys­tem, police patrols would arrest drunks and bring them into the local jail for an overnight stay. The arrests so made would “count” towards the depart­ment mea­sures of polic­ing activ­i­ty. Patrol­men were moti­vated thereby to pick up drunks and book them into the local jail, espe­cially in peri­ods when other arrest oppor­tu­ni­ties were slight. In con­trast, under the new sys­tem, the han­dling of drunks did not count towards an offi­cer’s arrest record. The con­se­quence: Police did not bring drunks into the new detox­i­fi­ca­tion cen­ters and the munic­i­pal­i­ties even­tu­ally had to set up sep­a­rate ser­vice sys­tems to rus­tle up clients for the detox­i­fi­ca­tion sys­tems.8

The illus­tra­tions given above should be suffi­cient to make the gen­eral point that the appro­pri­ate imple­men­ta­tion of social pro­grams is a prob­lem­atic mat­ter. This is espe­cially the case for pro­grams that rely on per­sons to deliver the ser­vice in ques­tion. There is no doubt that fed­er­al, state, and local agen­cies can cal­cu­late and deliver checks with pre­ci­sion and effi­cien­cy. There also can be lit­tle doubt that such agen­cies can main­tain a phys­i­cal infra-struc­ture that deliv­ers pub­lic ser­vices effi­cient­ly, even though there are a few exam­ples of the fail­ure of water and sewer sys­tems on scales that threaten pub­lic health. But there is a lot of doubt that human [pg18] ser­vices that are tai­lored to differ­ences among indi­vid­ual clients can be done well at all on a large scale basis.

We know that pub­lic edu­ca­tion is not doing equally well in facil­i­tat­ing the learn­ing of all chil­dren. We know that our men­tal health sys­tem does not often suc­ceed in treat­ing the chron­i­cally men­tally ill in a con­sis­tent and effec­tive fash­ion. This does not mean that some chil­dren can­not be edu­cated or that the chron­i­cally men­tally ill can­not be treat­ed—it does mean that our abil­ity to do these activ­i­ties on a mass scale is some­what in doubt


This paper started out with a recital of the sev­eral metal­lic laws stat­ing that eval­u­a­tions of social pro­grams have rarely found them to be effec­tive in achiev­ing their desired goals. The dis­cus­sion mod­i­fied the metal­lic laws to express them as sta­tis­ti­cal ten­den­cies rather than rigid and inflex­i­ble laws to which all eval­u­a­tions must strictly adhere. In this lat­ter sense, the laws sim­ply do not hold. How­ev­er, when stripped of their rigid­i­ty, the laws can be seen to be valid as sta­tis­ti­cal gen­er­al­iza­tions, fairly accu­rately rep­re­sent­ing what have been the end results of eval­u­a­tions “on-the-av­er­age”. In short, few large-s­cale social pro­grams have been found to be even min­i­mally effec­tive. There have been even fewer pro­grams found to be spec­tac­u­larly effec­tive. There are no social sci­ence equiv­a­lents of the .9

Where this con­clu­sion the only mes­sage of this paper, then it would tell a dis­mal tale indeed. But there is a more impor­tant mes­sage in the exam­i­na­tion of the rea­sons why social pro­grams fail so often. In this con­nec­tion, the paper pointed out two defi­cien­cies:

First, pol­icy rel­e­vant social sci­ence the­ory that should be the intel­lec­tual under­pin­ning of our social poli­cies and pro­grams is either defi­cient or sim­ply miss­ing. Effec­tive social poli­cies and pro­grams can­not be designed con­sis­tently until it is thor­oughly under­stood how changes in poli­cies and pro­grams can affect the social prob­lems in ques­tion. The social poli­cies and pro­grams that we have tested have been designed, at best, on the basis of com­mon sense and per­haps intel­li­gent guess­es, a weak foun­da­tion for the con­struc­tion of effec­tive poli­cies and pro­grams.

In order to make pro­gress, we need to deepen our under­stand­ing of the long range and prox­i­mate cau­sa­tion of our social prob­lems and our under­stand­ing about how active inter­ven­tions might alle­vi­ate the bur­dens of those prob­lems. This is not sim­ply a call for more funds for social sci­ence research but also a call for a redi­rec­tion of social sci­ence research toward under­stand­ing how pub­lic pol­icy can affect those prob­lems.

Sec­ond, in point­ing to the fre­quent fail­ures in the imple­men­ta­tion of [pg19] social pro­grams, espe­cially those that involve labor inten­sive deliv­ery of ser­vices, we may also note an impor­tant miss­ing pro­fes­sional activ­ity in those fields. The phys­i­cal sci­ences have their engi­neer­ing coun­ter­parts; the bio­log­i­cal sci­ences have their health care pro­fes­sion­als; but social sci­ence has nei­ther an engi­neer­ing nor a strong clin­i­cal com­po­nent. To be sure, we have clin­i­cal psy­chol­o­gy, edu­ca­tion, social work, pub­lic admin­is­tra­tion, and law as our coun­ter­parts to engi­neer­ing, but these are only weakly con­nected with basic social sci­ence. What is appar­ently needed is a new pro­fes­sion of social and orga­ni­za­tional engi­neer­ing devoted to the design of human ser­vices deliv­ery sys­tems that can deliver treat­ments with fidelity and effec­tive­ness.

In short, the dou­ble mes­sage of this paper is an argu­ment for fur­ther devel­op­ment of pol­icy rel­e­vant basic social sci­ence and the estab­lish­ment of the new pro­fes­sion of social engi­neer.


See Also

  1. eg. the , / Pour­nelle’s Iron Law of Bureau­cracy / Schwartz’s Iron law of insti­tu­tions, or the ; Aaron Shaw offers a col­lec­tion of 33 other laws, some (al­l?) of which seem to be real. –Ed­i­tor↩︎

  2. Note that the law empha­sizes that it applied pri­mar­ily to “large scale” social pro­grams, pri­mar­ily those that are imple­mented by an estab­lished gov­ern­men­tal agency cov­er­ing a region or the nation as a whole. It does not apply to small scale demon­stra­tions or to pro­grams run by their design­ers.↩︎

  3. See also . –Ed­i­tor↩︎

  4. One is reminded of the old phi­los­o­phy say­ing that . –Ed­i­tor↩︎

  5. Unfor­tu­nate­ly, it has proven diffi­cult to stop large scale pro­grams even when eval­u­a­tions prove them to be ineffec­tive. The fed­eral job train­ing pro­grams seem remark­ably resis­tant to the almost con­sis­tent ver­dicts of ineffec­tive­ness. This lim­i­ta­tion on the Edis­on­ian par­a­digm arises out of the ten­dency for large scale pro­grams to accu­mu­late staff and clients that have exten­sive stakes in the pro­gram’s con­tin­u­a­tion.↩︎

  6. This is a com­plex exam­ple in which there are many com­pet­ing expla­na­tions for the fail­ure of the pro­gram. In the first place, the pro­gram may be a good exam­ple of the fail­ure of prob­lem the­ory since the pro­gram was ulti­mately based on a the­ory of crim­i­nal behav­ior as psy­chopathol­o­gy. In the sec­ond place, the pro­gram the­ory may have been at fault for employ­ing coun­sel­ing as a treat­ment. This exam­ple illus­trates how diffi­cult it is to sep­a­rate out the three sources of pro­gram fail­ures in spe­cific instances.↩︎

  7. Rossi greatly under­sells Project Fol­low Through here: it was not merely an edu­ca­tional exper­i­ment but one of the largest ever run, and, sim­i­lar to the Office of Eco­nomic Oppor­tu­ni­ty’s “per­for­mance con­tract­ing” exper­i­ment, almost all of the inter­ven­tions failed (and were harm­ful), with the excep­tion of the peren­ni­al­ly-un­pop­u­lar inter­ven­tion.↩︎

  8. See also . –Ed­i­tor↩︎

  9. Or or, more spec­u­la­tive­ly, . –Ed­i­tor↩︎

  10. It’s unclear what book this is; World­Cat & Ama­zon & Google Books have no entry for a book named “Eval­u­a­tion of Newark and Hous­ton Polic­ing Exper­i­ments”, and Google returns only Rossi’s paper. The Police Foun­da­tion web­site lists 2 reports for 1985: “Neigh­bor­hood Police Newslet­ters: Exper­i­ments in Newark and Hous­ton” (exec­u­tive sum­mary, tech­ni­cal report, appen­dices) and “The Hous­ton Vic­tim Recon­tact Exper­i­ment” (exec­u­tive sum­mary, tech­ni­cal report, appen­dices). Pos­si­bly these were pub­lished together in a print form and this is what Rossi is ref­er­enc­ing? –Ed­i­tor↩︎

  11. This appears to be a ref­er­ence to 10 sep­a­rate pub­li­ca­tions. CLMS #1–7’s data and the #8 report are avail­able online; I have not found #9–10.↩︎

  12. It is worth con­trast­ing this strik­ing esti­mate of the effect usu­ally being zero in the IES’s RCTs as a whole with the far more san­guine esti­mates one sees derived from aca­d­e­mic pub­li­ca­tions in Lipsey & Wil­son 1993’s (and to a much lesser extent, Bond et al 2003’s “One Hun­dred Years of Social Psy­chol­ogy Quan­ti­ta­tively Described”). One man’s modus ponens…↩︎