The highs and lows of performance evaluation: Towards a measurement theory for machine learning
February 8, 2023 at 5 pm CET
Our understanding of performance evaluation measures for machine-learned classifiers has improved considerably over the last decades. However, there is a range of areas where this understanding is still lacking, leading to ill-advised practices in classifier evaluation. This is clearly problematic, since if machine learning researchers are unclear about what exactly their experiments are telling them about their machine learning algorithms, then how can end-users trust systems deploying those algorithms?
I suggest that in order to make further progress we need to develop a proper measurement theory of machine learning. Measurement theory studies the concepts of measurement and scale. If one has a way to measure, say, the length of individual rods or planks, this should also allow one to then calculate the combined length of concatenated rods or planks. What relevant concatenation operations are there in data science and AI, and what does that mean for the underlying measurement scale?
I discuss by example what such a measurement theory might look like and what kinds of new results it would entail. I furthermore argue that key properties such as classification ability and data set difficulty are unlikely to be directly observable, suggesting the need for latent-variable models. Ultimately, machine learning experiments need to go beyond simple correlations and aim to make causal inferences of the form ‘Algorithm A outperformed algorithm B because the classes were highly imbalanced’, or counterfactually, ‘if the classes were re-balanced, this performance difference between A and B would not have been observed’.
Peter Flach has been Professor of Artificial Intelligence at the University of Bristol since 2003. An internationally leading scholar in the evaluation and improvement of machine learning models using ROC analysis and calibration, he has also published on mining highly structured data, and has an interest in human-centred AI. He is author of Simply Logical: Intelligent Reasoning by Example (John Wiley, 1994) and Machine Learning: the Art and Science of Algorithms that Make Sense of Data (Cambridge University Press, 2012).
From 2010 until 2020, Prof Flach was Editor-in-Chief of the Machine Learning journal, one of the two top journals in the field that has been published for over 25 years by Kluwer and now Springer. He was Programme Co-Chair of the 1999 International Conference on Inductive Logic Programming, the 2001 European Conference on Machine Learning, the 2009 ACM Conference on Knowledge Discovery and Data Mining, and the 2012 European Conference on Machine Learning and Knowledge Discovery in Databases in Bristol. He is a past President and current Vice-President of the European Association for Data Science. He is a Fellow of the European Association for Artificial Intelligence and of the Alan Turing Institute for Data Science and Artificial Intelligence.