Orchestrating false beliefs about gender discrimination

[Edit: I wrote a more succinct and updated version of this post here. Probably read that instead.]

In this article from the New York Times, they claim that there is extreme gender discrimination in tech:

In a 2016 experiment conducted by the tech recruiting firm Speak With a Geek, 5,000 résumés with identical information were submitted to firms. When identifying details were removed from the résumés, 54 percent of the women received interview offers; when gendered names and other biographical information were given, only 5 percent of them did.

This turned out to be fake. (Twitter thread with details). The study doesn’t actually exist. It was also ridiculous. If you know anything about plausible effect sizes, or interview processes, or the real world, you would immediately see that a disparity of this magnitude was fully implausible.

What about other studies regarding gender discrimination? For example, there is that study about classical orchestras, where blind auditions massively increased the chance of women to get hired.

Here is The Guardian about this study:

Even when the screen is only used for the preliminary round, it has a powerful impact; researchers have determined that this step alone makes it 50% more likely that a woman will advance to the finals

This study has 1388 citations. It has also been featured on Freakonomics, TED talks, Reddit, Slate, New York Times, Wikipedia, and I’m sure countless other mediums. I hear about it in frequently in real life when gender discrimination is discussed. Active mentioning of it on Twitter can be seen in this search.

I have not once heard anything skeptical said about that study, and it is published in a fine journal. So one would think it is a solid result. But let’s try to look into the paper. It is publically available here. (Statistics follow, scroll down for conclusion.)

Table 4 presents the first results comparing success in blind auditions vs non-blind auditions.

The value for relative female success is the proportion of women that are successful in the audition process minus the proportion of men that are successful. The values for non-blind auditions are positive, meaning a larger proportion of women are successful, whereas the values for blind auditions are negative, meaning a larger proportion of men are successful. So, this table unambiguously shows that men are doing comparatively better in blind auditions than in non-blind auditions. The exact opposite of what is claimed.

Now, of course this measure could be confounded. It is possible that the group of people who apply to blind auditions is not identical to the group of people who apply to non-blind auditions. Perhaps women (mistakenly as it turns out) think that more of them succeed in blind auditions, and thus more women apply to those, resulting in the average skill level among the female applicants decreasing, which affects the results.

There is some data in which the same people have applied to both orchestras using blind auditions and orchestras using non-blind auditions, which is presented in table 5:

Here we see that the proportion of women advanced is higher in blind auditions in the hired, finals, and preliminaries categories, but lower in the semifinals category. However, it is highly doubtful that we can conclude anything from this table.

In the next table they instead address the issue by regression analysis. Here they can include covariates such as number of auditions attended, year, etc, hopefully correcting for the sample composition problems mentioned above. Such analyses should always be viewed critically, since the results can vary a lot depending on which covariates you include. They list the results in table 6:

This is a somewhat complicated regression table. Again the values fluctuate wildly, with the proportion of women advanced in blind auditions being higher in the finals, and the proportion of men advanced being higher in the semifinals. But again, the sample sizes are too small, the p-values are not significant, and nothing can be concluded from this. There is one p-value (0.042) that is nominally significant at 0.05. However, given that at least 8 tests are performed in this table, we should adjust for multiple testing, which with the Bonferroni correction gives us a p-value of 0.33.

The paper also looks at the issue from another angle: How do blind auditions impact the likelihood of a woman ending up being hired? The results in this section (even after a statistically questionable data split) are not significant, as they state in the paper:

Women are about 5 percentage points more likely to be hired than are men in a completely blind audition, although the effect is not statistically significant. The effect is nil, however, when there is a semifinal round, perhaps as a result of the unusual effects of the semifinal round. The impact for all rounds [columns (5) and (6)] is about 1 percentage point, although the standard errors are large and thus the effect is not statistically significant.

So, in conclusion, this study presents no statistically significant evidence that blind auditions increase the chances of female applicants. In my reading, the unadjusted results seem to weakly indicate the opposite, that male applicants have a slightly increased chance in blind auditions; but this advantage disappears with controls.

Scientist. Also see www.jsmp.dk

In the recent article Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015, the authors re-do 21 social science studies published in Nature and Science.

In this post I will read just the summaries of those studies and guess which ones will replicate, before seeing the results.

[Edit: Now turned into a full quiz at 80.000 hours: https://80000hours.org/psychology-replication-quiz/)

My general heuristic is the following:

Many people are worried about the AI alignment problem: That in the future we may develop a superintelligent general AI, which may become extremely powerful and be an existential threat to humanity. This problem was described in Nick Bostrom’s book Superintelligence. Wait But Why featured a good description of the problem. Here is a recent podcast about the topic with Sam Harris and Eliezer Yudkowsky.

Steven Pinker is famously skeptical of the potential dangers of superintelligent AIs, and has recently published an article fleshing out this position: We’re told to fear robots. But why do we think they’ll turn on…

Languages evolve in a similar way as animals evolve: Constantly, unpredictably and in a way that increases fitness in the current environment. In the same way that survival and mate choice forms species, the way people choose to use words forms language. If an old word is not useful anymore, it may disappear from language or change its meaning; and if a new meaning is needed it tends to get incorporated into the language. …

Twitter is an efficient selection mechanism for a type of false facts.

Successful tweets with factual claims have to be surprising / interesting, and there is a vast amount of false facts that have these attributes. Successful tweets also have to seem true, which only a fraction of the false tweets will. But even a small fraction of a large number can be quite large. And part of this fraction will seem true to even critical thinkers, and be spread by them. (I will give some examples of this further down.)

Millions of tweets with false information will be tried…

We have all heard it said that correlation is not causation. Maybe you have also heard the even more dry-sounding statistics term “sampling bias” mentioned. Sampling bias is when a sample is not representative of the full population, skewing the results. “Ok”, you may be thinking, “so statisticians should be careful about that when designing studies and such”. But it’s more than that, sampling bias is all around us, all the time. I typically think of it in another way, as “wait but maybe…

Let’s take winter bathing. Winter bathing is healthy. We know this because people who do it…

Get the Medium app