Assessing the Evaluation of Image and Language Generation Methods

In a lecture delivered in 1883, William Thomson, 1st Baron Kelvin, noted that “when you can measure what you are speaking about, and express it in numbers, you know something about it” (Thomson, 1889). However, research has raised questions about the reliability of commonly used metrics when comparing different models, and the ability of established methods to differentiate between machine-generated and human-generated outputs as the size of models scales up. This contribution explored the suitability of quantitative measures for evaluating generative methods and the role of human evaluation in assessing the models and outputs of machine learning approaches.

Limitations to automated evaluation were profiled using a study of Fréchet Inception Distance (FID) (Heusel et al., 2017). Quantitative measures are designed to measure formal properties of outputs – but are inadequate for assessing the core capabilities targeted by current generative AI systems such as cross-modal alignment and refined semantic relationships. Empirical analysis has also demonstrated that account should be taken of inconsistent implementation of metrics – and the relationship between training data selection to scores for model-based measures. Human assessment remains indispensable when systems are designed for live deployment. In this case, challenges take the form of cost, consistency, and sample error. The presentation concluded by demonstrating how the quality of outputs from current models is testing the limits of both forms of evaluation.

W. Thomson. Popular Lectures and Addresses. 1889.

M.Heusel et al. Gans trained by a two time-scale update rule converge to a local nash equilibrium. 2017.