computer image of programming decision trees

Many people are familiar with algorithms and machine learning when it comes to applications like social media or advertising, but it can be hard to appreciate all of the diverse applications that machine learning has been applied to. For example, in addition to regulating all sorts of financial transactions, an algorithm might be used to evaluate teaching performances, or in the medical field to help identify illness or those at risk of disease. With this large array of applications comes a large array of ethical factors which become relevant as more and more real world consequences are considered. For example, machine learning has been used to train AI to detect cancer. But what happens when the algorithm is wrong? What are the ethical issues when it isn’t completely clear how the AI is making decisions and there is a very real possibility that it could be wrong?

Consider the example of applications of machine learning in order to predict whether someone charged with a crime is likely to be a recidivist. Because of massive backlogs in various court systems many have turned to such tools in order to get defendants through the court system more efficiently. Criminal risk assessment tools consider a number of details of a defendant’s profile and then produce a recidivism score. Lower scores will usually mean a more lenient sentence for committing a crime, while higher scores will usually produce harsher sentences. The reasoning is that if you can accurately predict criminal behavior, resources can be allocated more efficiently for rehabilitation or for prison sentences. Also, the thinking goes, decisions are better made based on data-driven recommendations than the personal feelings and biases that a judge may have.

But these tools have significant downsides as well. As Cathy O’Neil discusses in her book Weapons of Math Destruction, statistics show that in certain counties in the U.S. a Black person is three times more likely to get a death sentence than a white person, and so the application of computerized risk models intended to reduce prejudice, are no less prone to bias. As she notes, “The question, however, is whether we’ve eliminated human bias or simply camouflaged it with technology.” She points out that questionnaires used in some models include questions like when “the first time you ever were involved with the police” which is likely to yield very different answers depending on whether the respondent is white or Black. As she explains “if early ‘involvement’ with the police signals recidivism, poor people and racial minorities look far riskier.” So, the fact that such models are susceptible to bias also means they are not immune to error.

As mentioned, researchers have also applied machine learning in the medical field as well. Again, the benefits are not difficult to imagine. Cancer-detecting AI has been able to identify cancer that humans could not. Faster detection of a disease like lung cancer allows for quicker treatment and thus the ability to save more lives. Right now, about 70% of lung cancers are detected in late stages when it is harder to treat.

AI not only has the potential to save lives, but to also increase efficiency of medical resources as well. Unfortunately, just like the criminal justice applications, applications in the medical field are also subject to error. For example, hundreds of AI tools were developed to help deal with the COVID-19 pandemic, but a study by the Turing Institute found that AI tools had little impact. In a review of 232 algorithms for diagnosing patients, a recent medical journal paper found that none of them were fit for clinical use. Despite the hype, researchers are “concerned that [AI] could be harmful if built in the wrong way because they could miss diagnoses and underestimate the risk for vulnerable patients.”

There are lots of reasons why an algorithm designed to detect things or sort things might make errors. Machine learning requires massive amounts of data and so the ability of an algorithm to perform correctly will depend on how good the data is that it is trained with. As O’Neil has pointed out, a problematic questionnaire can lead to biased predictions. Similarly, incomplete training data can cause a model to perform poorly in real-world settings. As Koray Karaca’s recent article on inductive risk in machine learning scenarios explains, creating a model requires methodological and precise choices to be made. But these decisions are often driven by certain background assumptions – plagued by simplification and idealization – and which create problematic uncertainties. Different assumptions can create different models and thus different possibilities of error. However, there is always a gap between a finite amount of empirical evidence and an inductive generalization, meaning that there is always an inherent risk in using such models.

If an algorithm determines that I have cancer and I don’t, it could dramatically affect my life in all sorts of morally salient ways. On the other hand, if I have cancer and the algorithm says I don’t, it can likewise have a harmful moral impact on my life. So is there a moral responsibility involved and if so, who is responsible? In a 1953 article called “The Scientist Qua Scientist Makes Value Judgments” Richard Rudner argues that “since no scientific hypothesis is completely verified, in accepting a hypothesis the scientist must make the decision that evidence is sufficiently strong or that the probability is sufficiently high to warrant the acceptance of the hypothesis…How sure we need to be before we accept a hypothesis will depend on how serious a mistake would be.”

These considerations regarding the possibility of error and the threshold for sufficient evidence represent calculations of inductive risk. For example, we may judge that the consequences of asserting that a patient does not have cancer when they actually do to be far worse than the consequences of asserting that a patient does have cancer when they actually do not. Because of this, and given our susceptibility to error, we may accept a lower standard of evidence for determining that a patient has cancer but a higher standard for determining the patient does not have cancer to mitigate and minimize the worst consequences if an error occurs. But how do algorithms do this? Machine learning involves optimization of a model by testing it against sample data. Each time an error is made, a learning algorithm updates and adjusts parameters to reduce the total error which can be calculated in different ways.

Karaca notes that optimization can be carried out either in cost-sensitive or -insensitive ways. Cost-insensitive training assigns the same value to all errors, while cost-sensitive training involves assigning different weights to different errors. But the assignment of these weights is left to the modeler, meaning that the person who creates the model is responsible for making the necessary moral judgments and preference orderings of potential consequences. In addition, Karaca notes that inductive risk concerns arise for both the person making methodological choices about model construction and later for those who must decide whether to accept or reject a given model and apply it.

What this tells us is that machine learning inherently involves making moral choices and that these can bear out in evaluations of acceptable risk of error. The question of defining how “successful” the model is is tied up with our own concern about risk. But this only poses an additional question: How is there accountability in such a system? Many companies hide the results of their models or even their existence. But, as we have seen, moral accountability in the use of AI is of paramount importance. At each stage of assessment, we encounter an asymmetry in information that pits the victims of such AI to “prove” the algorithm wrong against available evidence that demonstrates how “successful” the model is.