Interrater reliability (IRR) statistics, like Cohen’s kappa, measure agreement between raters beyond what is expected by chance when classifying items into categories. While Cohen’s kappa has been widely used, it has several limitations, prompting development of Gwet’s agreement statistic, an alternative “kappa”statistic which models chance agreement via an “occasional guessing” model. However, we show that Gwet’s formula for estimating the proportion of agreement due to chance is itself biased for intermediate levels of agreement, despite overcoming limitations of Cohen’s kappa at high and low agreement levels. We derive a maximum likelihood estimator for the occasional guessing model that yields an unbiased estimator of the IRR, which we call the maximum likelihood kappa (
). The key result is that the chance agreement probability under the occasional guessing model is simply equal to the observed rate of disagreement between raters. The
statistic provides a theoretically principled approach to quantifying IRR that addresses limitations of previous
coefficients. Given the widespread use of IRR measures, having an unbiased estimator is important for reliable inference across domains where rater judgments are analyzed.
References
[1]
Gwet, K.L. (2014) Handbook of Inter-Rater Reliability: The Definitive Guide to Measuring the Extent of Agreement among Raters. Advanced Analytics, LLC.
[2]
Cicchetti, D.V. and Feinstein, A.R. (1990) High Agreement but Low Kappa: II. Resolving the Paradoxes. Journal of Clinical Epidemiology, 43, 551-558. https://doi.org/10.1016/0895-4356(90)90159-m
[3]
Feinstein, A.R. and Cicchetti, D.V. (1990) High Agreement but Low Kappa: I. The Problems of Two Paradoxes. Journal of Clinical Epidemiology, 43, 543-549. https://doi.org/10.1016/0895-4356(90)90158-l
[4]
Wongpakaran, N., Wongpakaran, T., Wedding, D. and Gwet, K.L. (2013) A Comparison of Cohen’s Kappa and Gwet’s AC1 When Calculating Inter-Rater Reliability Coefficients: A Study Conducted with Personality Disorder Samples. BMC Medical Research Methodology, 13, Article No. 61. https://doi.org/10.1186/1471-2288-13-61
[5]
Ohyama, T. (2020) Statistical Inference of Gwet’s AC1 Coefficient for Multiple Raters and Binary Outcomes. Communications in Statistics—Theory and Methods, 50, 3564-3572. https://doi.org/10.1080/03610926.2019.1708397
[6]
Jimenez, A.M. and Zepeda, S.J. (2020) A Comparison of Gwet’s AC1 and Kappa When Calculating Inter-Rater Reliability Coefficients in a Teacher Evaluation Context. Journal of Education Human Resources, 38, 290-300. https://doi.org/10.3138/jehr-2019-0001
[7]
Gaspard, N., Hirsch, L.J., LaRoche, S.M., Hahn, C.D. and Westover, M.B. (2014) Interrater Agreement for Critical Care EEG Terminology. Epilepsia, 55, 1366-1373. https://doi.org/10.1111/epi.12653
[8]
Cohen, J. (1960) A Coefficient of Agreement for Nominal Scales. Educational and Psychological Measurement, 20, 37-46. https://doi.org/10.1177/001316446002000104
[9]
Cohen, J. (1968) Weighted Kappa: Nominal Scale Agreement Provision for Scaled Disagreement or Partial Credit. Psychological Bulletin, 70, 213-220. https://doi.org/10.1037/h0026256
[10]
Gwet, K. (2002) Kappa Statistic Is Not Satisfactory for Assessing the Extent of Agreement between Raters. Statistical Methods for Inter-Rater Reliability Assessment, 1, 1-6.
[11]
Gwet, K. (2002) Inter-Rater Reliability: Dependency on Trait Prevalence and Marginal Homogeneity. Statistical Methods for Inter-Rater Reliability Assessment, 2, 1-9.
[12]
Gwet, K.L. (2008) Computing Inter-Rater Reliability and Its Variance in the Presence of High Agreement. British Journal of Mathematical and Statistical Psychology, 61, 29-48. https://doi.org/10.1348/000711006x126600
[13]
Byrt, T., Bishop, J. and Carlin, J.B. (1993) Bias, Prevalence and Kappa. Journal of Clinical Epidemiology, 46, 423-429. https://doi.org/10.1016/0895-4356(93)90018-v
[14]
Uebersax, J.S. (1987) Diversity of Decision-Making Models and the Measurement of Interrater Agreement. Psychological Bulletin, 101, 140-146. https://doi.org/10.1037//0033-2909.101.1.140
[15]
Viera, A.J. and Garrett, J.M. (2005) Understanding Interobserver Agreement: The Kappa Statistic. Family Medicine, 37, 360-363.
[16]
Strijbos, J., Martens, R.L., Prins, F.J. and Jochems, W.M.G. (2006) Content Analysis: What Are They Talking about? Computers & Education, 46, 29-48. https://doi.org/10.1016/j.compedu.2005.04.002