photograph of three different multi-colored pie charts

Let’s look at three different stories and use them to investigate statistical generalizations.

Story 1

This semester I’m teaching a Reasoning and Critical Thinking course. During the first class, I ran through various questions designed to show that human thinking is subject to predictable and systematic errors. Everything was going swimmingly. Most students committed the conjunction fallacy, ignored regression towards the mean, and failed the Wason selection task.

I then came to one of my favorite examples from Kahneman and Tversky: base rate neglect. I told the students that “Steve is very shy and withdrawn, invariably helpful but with little interest in people or in the world of reality. A meek and tidy soul, he has a need for order and structure, and a passion for detail,” and then asked how much more likely it is that Steve is a librarian than a farmer. Most students thought it was moderately more likely that Steve was a librarian.

Delighted with this result, I explained the mistake. While Steve is more representative of a librarian, you need to factor in base-rates to conclude he is more likely to actually be a librarian. In the U.S. there are about two million farmers and less than one hundred and fifty thousand librarians. Additionally, while 70% of farmers are male, only about 20% of librarians are. So for every one librarian named Steve you should assume there are at least forty-five farmers so named.

This culminated in my exciting reveal: even if you think that librarians are twenty times more likely than farmers to fit the personality sketch, you should still think Steve is more than twice as likely to be a farmer.

This is counter-intuitive, and I expected pushback. But then a student asked a question I had not anticipated. The student didn’t challenge my claim’s statistically illegitimacy, he challenged its moral illegitimacy. Wasn’t this a troubling generalization from gender stereotypes? And isn’t reasoning from stereotypes wrong?

It was a good question, and in the moment I gave an only so-so reply. I acknowledged that judging based on stereotypes is wrong, and then I…

  1. distinguished stereotypes proper from empirically informed statistical generalizations (explaining the psychological literature suggesting stereotypes are not statistical generalizations, but unquantified generics that the human brain attributes to intrinsic essences);
  2. explained how the most pernicious stereotypes are statistically misleading (e.g., we accept generic generalizations at low statistical frequencies about stuff we fear), and so would likely be weakened by explicit reasoning from rigorous base-rates rather than intuitive resemblances;
  3. and pointed out that racial disparities present in statistical generalizations act as important clarion calls for political reform.

I doubt my response satisfied every student — nor should it have. What I said was too simple. Acting on dubious stereotypes is often wrong, but acting on rigorous statistical generalizations can also be unjust. Consider a story recounted in Bryan Stevenson’s Just Mercy:

Story 2

“Once I was preparing to do a hearing in a trial court in the Midwest and was sitting at counsel table in an empty courtroom before the hearing. I was wearing a dark suit, white shirt, and tie. The judge and the prosecutor entered through a door in the back of the courtroom laughing about something.

When the judge saw me sitting at the defense table, he said to me harshly, ‘Hey, you shouldn’t be in here without counsel. Go back outside and wait in the hallway until your lawyer arrives.’

I stood up and smiled broadly. I said, ‘Oh, I’m sorry, Your Honor, we haven’t met. My name is Bryan Stevenson, I am the lawyer on the case set for hearing this morning.’

The judge laughed at his mistake, and the prosecutor joined in. I forced myself to laugh because I didn’t want my young client, a white child who had been prosecuted as an adult, to be disadvantaged by a conflict I had created with the judge before the hearing.”

This judge did something wrong. Because Bryan Stevenson is black, the judge assumed he was the defendant, not the defense. Now, I expect the judge acted on an implicit racist stereotype, but suppose the judge had instead reasoned from true statistical background data. It is conceivable that more of the Black people who enter that judge’s courtroom — even those dressed in suit and tie — are defendants than defense attorneys. Would shifting from stereotypes to statistics make the judge’s behavior ok?

No. The harm done had nothing to do with the outburst’s mental origins, whether it originated in statistics or stereotypes. Stevenson explains that what is destructive is the “accumulated insults and indignations caused by racial presumptions,” the burden of “constantly being suspected, accused, watched, doubted, distrusted, presumed guilty, and even feared.” This harm is present whether the judge acted on ill-formed stereotypes or statistically accurate knowledge of base-rates.

So, my own inference about Steve is not justified merely because it was grounded in a true statistical generalization. Still, I think I was right and the judge was wrong. Here is one difference between my inference and judge’s. I didn’t act as though I knew Steve was a farmer — I just concluded it was more likely he was. The judge didn’t act the way he would if he thought it was merely likely Stevenson was the defendant. The judge acted as though he knew Stevenson was the defendant. But the statistical generalizations we are considering cannot secure such knowledge.

The knowledge someone is a defendant justifies different behavior than the thought someone is likely a defendant. The latter might justify politely asking Stevenson if he is the defense attorney. But the latter couldn’t justify the judge’s actual behavior, behavior unjustifiable unless the judge knows Stevenson is not an attorney (and dubious even then). A curious fact about ethics is that certain actions (like asserting or punishing a criminal) require, not just high subjective credence, but knowledge. And since mere statistical information cannot secure knowledge, statistical generalizations are unsuitable justifications for some actions.

Statistical disparities can justify some differential treatment. For instance, seeing that so few of the Black people in his courtroom are attorneys could justify the judge in funding mock trial programs only at majority Black public schools. Indeed, it might even justify the judge, in these situations, only asking Black people if they are new defense attorneys (and just assuming white people are). But it cannot justify behavior, like harsh chastisement, that requires knowledge the person did something wrong.

I didn’t do anything that required knowledge that Steve was a farmer. So does this mean I’m in the clear? Maybe. But let’s consider one final story from the recent news:

Story 3

Due to COVID-19 the UK canceled A-level exams — a primary determinant of UK college admissions. (If you’re unfamiliar with the A-levels they are sort of like really difficult subject-specific SAT exams.) The UK replaced the exams with a statistical generalization. They subjected the grades that teachers and schools submitted to a statistical normalization based on the historical performance of the student’s school. Why did the Ofqual (Office of Qualifications and Examinations Regulation) feel the need to normalize the results? Well, for one thing, the predicted grades that teachers submitted were 12% higher than last year’s scores (unsurprising without any external test to check teacher optimism).

The normalization, then, adjusted many scores downward. If the Ofqual predicted, based on historical data, that at least one student in a class would have failed the exam then the lowest scoring student’s grade was adjusted to that failing grade (irrespective of how well the teacher predicted the student would have done).

Unsurprisingly, this sparked outrage and the UK walked back the policy. Student’s felt the system was unfair since they had no opportunity to prove they would have bucked the trend. Additionally since wealthier schools tended to perform better on the A-levels in previous years, the downgrading hurt students in poorer schools at a higher rate.

Now, this feels unfair. (And since justifiability to the people matters for government policy, I think the government made the right choice in walking back the policy.) But was it actually unfair? And if so, why?

It’s not an issue of stereotypes — the changes weren’t based on hasty stereotypes, but rather on a reasonable statistical generalization. It’s not an issue of compounding algorithmic bias (of the sort described in O’Neil’s book) as the algorithm didn’t produce results more unequal than actual test results. Nor was the statistical generalization used in a way that requires knowledge. College admissions don’t assume we know one student is better than another. Rather, they use lots of data to make informed guesses about which students will be the fit. The algorithm might sometimes misclassify, but so could any standardized test.

So what feels unfair? My hunch is the algorithm left no space for the exceptional. Suppose four friends who attended a historically poor performing school spent the last two years frantically studying together in a way no previous group had. Had they sat the test, all could have secured top grades — a first for the school. Unfortunately, they couldn’t all sit the test, and because their grades are normalized against previous years the algorithm eliminates their possibility of exceptional performance. (To be fair to the UK, they said students could sit the exams in the fall if they felt they could out-perform their predicted score).

But what is unfair about eliminating the possibility of exceptional success? My further hunch is that seeing someone as having the possibility of exceptional success is part of what it is to see them as an individual (perhaps for Kantian reasons of seeing someone as a free first cause of their own actions). Sure, we can accept that most people will be like most people. We can even be ok with wealthier schools, in the aggregate, consistently doing better on standardized tests. But we aren’t ok with removing the possibility for any individual to be an exception to the trend.

When my students resisted my claim that Steve was likely a farmer, they did not resist the generalization itself. They agreed most farmers are men and most librarians are women. But they were uncomfortable moving from that general ratio to a probabilistic judgment about the particular person, Steve. They seemed to worry that applying the generalization to Steve precluded seeing Steve as an exception.

While I think the students were wrong to think the worry applied in this case — factoring in base-rates doesn’t prevent the exceptional from proving their uniqueness — they might be right that there is a tension between seeing someone within a statistical generalization and seeing someone as an individual. It’s a possibility I should have recognized, and a further way acting on even good statistical generalizations might sometimes be wrong.

Marshall is currently completing his PhD in Philosophy at Florida State University. His primarily studies the intersection of ethics and the nature of persons. Outside of Academia, Marshall also directs curricular design for high school debate camps with the Victory Briefs Institute.