AI Comedy of Errors, Or: The Importance of Choosing Your AI Performance Metric With Care
1. Accuracy is one of the main imperatives of Trustworthy AI.
AI systems are simply optimization machines. Yes, there is a long-standing battle over what an AI system is or does, but this will not concern us here. We’ll take a very high-level view and assume that an AI system is a system that automates a process that would normally involve human intelligence[1]. And the major requirement is that this system should do so with as few errors as possible.
In practice, once the humans in charge (someone from management, the head machine learning engineer, the lead data scientist, the client, …) have defined what an error is, the AI system is then designed to optimize for the solution with the least error (or conversely, it can be optimized for the solution with the most correct outputs). Achieving the best possible optimization (i.e. the least number of errors, or the greatest number of correct responses) is a hallmark of AI quality.
In common parlance, the word accuracy best captures this characterization as minimization of error, this concept of being correct and true. According to the EU High Level Expert Group on AI (HLEG)[2], “Accuracy pertains to an AI system’s ability to make correct judgements, for example to correctly classify information into the proper categories, or its ability to make correct predictions, recommendations, or decisions based on data or models.” Of the seven requirements for trustworthy AI that are listed in the HLEG guide[3], the second requirement, technical robustness and safety, includes “…accuracy, reliability and reproducibility.” The draft EU AI regulation[4], which takes inspiration from the EU HLEG, dedicates a whole article (currently, Article 15) to the accuracy, robustness and cybersecurity requirements for a high risk AI system. Thus, at least from an EU perspective, accuracy has become enshrined as one of the main imperatives of trustworthy AI.
2. Accuracy doesn’t always mean what you think.
However, in the world of AI, accuracy has a much narrower meaning. Accuracy is only used to measure the performance of classifiers (or equivalently, classification systems). These are systems that automatically categorize input according to a finite list of possible categories. One example of a classifier would be a system deployed by a bank to predict if loan applicants are at low, medium, or high risk for defaulting on their loan, given information about the applicants’ current account balance, home ownership status, employment situation, and requested loan amount. Another example would be a system that classifies if social media content is hate speech or not.
Why is the restriction of the definition of accuracy to classifiers important? To put this discussion in context: machine learning (which is itself a subset of AI) is often broken down into supervised, unsupervised, and reinforcement learning. And classifiers belong to the group of methods called supervised learning. For machine learning engineers, accuracy has no relevance in the context of unsupervised or reinforcement learning. [This compels some machine learning experts to worry that only classification systems are covered by the AI Act accuracy requirements, inadvertently leaving other AI systems, such as those based on unsupervised or reinforcement learning, out of the regulation[5]. We believe this is a misunderstanding that is best clarified at the level of international standards, which can provide appropriate translations from common-use language to technology-specific terminology].
And the machine learning definition of accuracy? To understand this, it is necessary to realize that the performance of a classifier is measured based on how well it predicts on a test set. The test set consists of items for which the true class assignment is known; the accuracy is then the number of items that the classifier correctly classified, divided by the total number of items in the test set.
3. Some performance metrics may not be appropriate in your use case.
If we focus only on classification systems - is accuracy then a good indicator of the system’s performance “in the wild”? Very much, of course, depends on the quality and relevance of the test set that was used to measure the system’s accuracy. But that’s another can of worms which we won’t open now. So assume the test set is “good”, and let’s try this definition of accuracy out on an example.
Suppose you are Head of Responsible Online Discourse at a social media platform. In this function, you are presented with the latest hate speech detector, an AI system that, when given a comment or phrase as input, will determine if that comment is hate speech or not. It has been developed using state-of-the-art Natural Language Processing techniques, has been designed to be language neutral, and has an incredible 99% accuracy (if you read the fine print though, due to limited availability of suitable test sets, it has been tested on only 4 languages other than English, and the accuracy is not quite as stellar for those). Your social media platform handles a message volume of approximately 1 million comments per day (a very modest estimate, although actual data on daily social media content is difficult to find). You are in charge of deciding whether, and how, to deploy this model to flag hate speech. Fortunately, those comments are in English, so you don’t need to worry about the fine print.
You are trying to determine if you should allow the system to automatically censor comments that are classified as hate speech, or whether there should be some kind of human in the loop monitoring?
This has triggered four follow-up questions:
Suppose the system has classified a comment as hate speech. What is the chance that the comment really is hate speech?
How many comments can I expect to find every day, that have been falsely labelled as hate speech by the classifier?
Suppose the system has determined that a comment is not hate speech. What is the chance that the system is wrong, and the comment actually is hate speech?
How many hate speech comments can I expect the system to miss every day?
Do you know the answer? Write it down before proceeding!
You consult your data scientist, and she tells you that you are missing essential information in order to answer your questions. One major piece of the puzzle she is fortunately able to supply out of internal research (generally, such information is difficult to find in published research[6]): what is the prevalence of hate speech on your platform. The prevalence of hate speech indicates what proportion of the 1 million comments per day can be expected to be hate speech. She informs you that prevalence is 1% on your platform[7].
With this information, you can now determine that, on average, you can expect 0.01 * 1000000 = 10000 hate speech comments on your platform per day. This means that the remaining 990000 comments are not hate speech. However, this doesn’t seem to have brought you any closer to answering your four questions above.
4. Choose your performance metrics carefully.
Your data scientist has some bad news: accuracy is not the correct performance measure in order to answer your questions, especially since the prevalence is so small. What you need are the True Positive Rate, and the False Positive Rate. (By the way, in the semantic salad that, by now you should realize, is typical of the field of AI, the True Positive Rate is also known under different names: True Positive Rate = Sensitivity = Recall).
The True Positive Rate and the False Positive Rate are metrics that are calculated based on the performance of the classifier on the test set (just the same as accuracy).
The True Positive Rate looks at the proportion of correctly identified hate speech comments, out of the total number of hate speech comments in the test set.
The False Positive Rate looks at the proportion of comments that the classifier incorrectly labelled as hate speech, out of the total number of non-hate speech comments in the test set.
You contact the developer of the classification system, and they are able to provide you with the performance measures you need: The True Positive Rate is 98%, and the False Positive Rate is 1% (no, this is not a typo, the True Positive Rate and the False Positive Rate do not have to add up to 100%).
So, now comes some math:
Since the True Positive Rate is 98%, and you recall from above that your platform expects about 10000 hate speech comments per day, then the classifier can be expected to correctly identify about 9800 comments as hate speech. This leaves approximately 200 hate speech comments that are missed by the system every day. This answers question (2) above.
Since the False Positive Rate is 1%, and the platform expects about 990000 non-hate speech comments per day, then there will be on average 9900 comments that are falsely labelled as hate speech every day. This answers question (4) above.
What about questions (1) and (3)? More math … or, if the math makes your head ache, just read the parts in bold
The number of correctly identified non-hate speech comments, based on the calculations above, is 990000 - 9900 = 980100. The number of hate speech comments that were incorrectly labelled as non-hate speech, as computed above, is 200. So the total number of comments that are labelled as non-hate speech is 980100 + 200 = 980300. This means that a total of 980300 are labelled as non-hate speech, and of those, 200 are incorrectly labelled, and are actually hate speech. - i.e. the chance that a hateful comment has been incorrectly labelled as non-hate speech by your system is almost 0%. Answer to question (3).
The number of correctly identified hate speech comments, as calculated above, is 9800. The number of incorrectly labelled hate speech comments as computed above is 9900. So the total number of comments that are labelled as hate speech is . This means that a total of 19700 are labelled as hate speech, and of those, just 9800 are indeed hate speech. - i.e. the chance that a comment that has been labelled as hate speech by your system actually is hate speech is just under 50%. Answer to question (1). You are stunned that the AI is performing so poorly!
Well, but what if the prevalence is actually lower than 1%, and closer to 0.001%, as suggested by the social media transparency reports[8]? With prevalence at 0.001%, the chance that a comment that has been labelled as hate speech by your system actually is hate speech is 0.1%!! That's right, with prevalence down at 0.001%, pretty much every item that is labelled as hate speech by your classifier is actually not hate speech. But don't take my word for it: here is an excel sheet , you can play around with the numbers yourself.
There is, actually, a test set performance metric that appears to compute exactly the answer to your question (1): the Positive Predictive Value. Similarly, the answer to your question (2) is captured by the metric called False Omission Rate. However, a word of caution: as the calculations above have shown, the Positive Predictive Value and the False Omission Rate depend heavily on the prevalence of hate speech. It is therefore best to calculate the responses to questions (1) and (3) yourself, rather than rely on the values for Positive Predictive Value and False Omission Rate obtained from the AI system provider. Why? Because you cannot be sure that the prevalence of hate speech comments in the AI system provider’s test set is the same as the actual prevalence of hate speech comments on your platform. Fortunately, as shown above, using the True Positive Rate, the False Positive Rate, and the prevalence, you can easily make the necessary calculations yourself!
5. Don’t solve the puzzle alone. Collaboration helps.
You now have some tricky decisions to make. If you fully automate the labelling and removal of hate speech comments, about 50% percent of the comments taken down in this manner will have been removed incorrectly!
If, instead, you instate a full human review of all comments flagged as hate speech before they are removed, then this will involve human review of about 19700 comments per day. Perhaps, your platform has the financial resources to hire a sufficiently large cohort of human reviewers. However, the heavy psychological toll of reviewing, day by day, several hundreds, if not thousands, of hateful comments should not be underestimated.
One other complicating factor needs to be mentioned. When AI systems are used in decisions that impact people, the errors that they make tend to impact certain groups of people more than others. In fact, a growing body of research shows that offensive language detection systems have higher error rates for already disadvantaged demographics (i.e. women, LGBTQI community, ethnic or religious minorities, etc)[9] [10]. In the case of hate speech detection: by definition, a hateful comment targets a disadvantaged community, and therefore an error in catching such a comment is disadvantageous to them. But at the same time: a false positive – or comment that is falsely labelled to be offensive – is also more likely to come from a disadvantaged demographic, or be on a topic of interest to them. So that the wrongful censoring by AI system would be more likely to harm exactly the communities the system is intended to protect.
Can these issues be solved? Certainly, approaches exist that can take advantage of the efficiency of automation, while minimizing the risk of algorithmic discrimination. Understanding the correct performance measures to consider in order to detect the limits and possible negative impacts of a purely algorithmic solution was a first step in the right direction. Just as figuring out this first step required that you ask the right questions, and obtain input from your AI system provider and from your data experts (machine learning engineer, data scientist, statistician, numbers wizard, …), working on an appropriate solution will require collaboration and input from various stakeholders. And that’s a blog post for another time.
References
[1] https://plato.stanford.edu/entries/artificial-intelligence/
[2] https://digital-strategy.ec.europa.eu/en/library/ethics-guidelines-trustworthy-ai, see p. 17
[3] https://digital-strategy.ec.europa.eu/en/library/ethics-guidelines-trustworthy-ai, Chapter II.1
[4] https://digital-strategy.ec.europa.eu/en/library/proposal-regulation-laying-down-harmonised-rules-artificial-intelligence
[5] https://www.politico.eu/newsletter/ai-decoded/parliament-kickstarts-ai-debate-epp-makes-its-move-technical-errors-exclude-most-machine-learning-techniques-2/.
[6] Vidgen, B., Margetts, H., Harris, A., ‘How much online abuse is there? A systematic review of evidence for the UK’, The Alan Turing Institute, November 2019
[7] For some real world context: According to analyses of social media platforms' transparency reports by Vidgen et al, the prevalence can be as low as 0.001%. However, Amnesty Italy research (https://www.amnesty.it/campagne/contrasto-allhate-speech-online/) suggests that, at least in the context of online political discourse, prevalence of hate speech lies around 1%.
[8] See footnote 7.
[9] Sap, M., Card, D., Gabriel, S., Choi, Y., Smith, N., (2019), 'The Risk of Racial Bias in Hate Speech Detection', Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1668 – 1678.
[10] Dixon, L., Li, J., Sorensen, J., Thain, N., Vasserman, L., (2018), 'Measuring and Mitigating Unintended Bias in Text Classification', Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society (AIES'18), pp. 67 – 73