Cautionary Remarks When Testing Agreement between Two Raters for Continuous Scale Measurements: A Tutorial in Clinical Epidemiology with Implementation Using R
Background: When continuous scale measurements are available, agreements between two measuring devices are assessed both graphically and analytically. In clinical investigations, Bland and Altman proposed plotting subject-wise differences between raters against subject-wise averages. In order to scientifically assess agreement, Bartko recommended combining the graphical approach with the statistical analytic procedure suggested by Bradley and Blackwood. The advantage of using this approach is that it enables significance testing and sample size estimation. We noted that the direct use of the results of the regression is misleading and we provide a correction in this regard. Methods: Graphical and linear models are used to assess agreements for continuous scale measurements. We demonstrate that software linear regression results should not be readily used and we provided correct analytic procedures. The degrees of freedom of the F-statistics are incorrectly reported, and we propose methods to overcome this problem by introducing the correct analytic form of the F statistic. Methods for sample size estimation using R-functions are also given. Results: We believe that the tutorial and the R-codes are useful tools for testing and estimating agreement between two rating protocols for continuous scale measurements. The interested reader may use the codes and apply them to their available data when the issue of agreement between two raters is the subject of interest.
References
[1]
Bland, J.M. and Altman, D.G. (1986) Statistical Methods for Assessing Agreement between Two Methods of Clinical Measurement. The Lancet, 327, 307-310. https://doi.org/10.1016/S0140-6736(86)90837-8
[2]
Bland, J.M. and Altman, D.G. (1995) Comparing Methods of Measurement: Why Plotting Difference against Standard Method Is Misleading. The Lancet, 346, 1085-1087. https://doi.org/10.1016/S0140-6736(95)91748-9
[3]
Bartko, J.J. (1994) Measures of Agreement: A Single Procedure. Statistics in Medicine, 13, 737-745. https://doi.org/10.1002/sim.4780130534
[4]
Bradley, E.L. and Blackwood, L.G. (1989) Comparing Paired Data: A Simultaneous Test for Means and Variances. The American Statistician, 43, 234-235. https://doi.org/10.1080/00031305.1989.10475665
[5]
Stephenson, J.M. and Babiker, A. (2000) Overview of Study Design in Clinical Epidemiology. Sexually Transmitted Infections, 76, 244-247. https://doi.org/10.1136/sti.76.4.244
[6]
Last, J.M. (1988) What Is “Clinical Epidemiology”? Journal of Public Health Policy, 9, 159-163. https://doi.org/10.2307/3343001
Kottner, J., et al. (2011) Guidelines for Reporting Reliability and Agreement Studies (GRRAS). Journal of Clinical Epidemiology, 64, 96-106. https://doi.org/10.1016/j.jclinepi.2010.03.002
[10]
Morgan, W.A. (1939) A Test for the Significance of the Difference between Two Variances in a Sample from a Normal Bivariate Population. Biometrika, 31, 13-19. https://doi.org/10.1093/biomet/31.1-2.13
[11]
Pitman, E.J.G. (1939) A Note on Normal Correlation. Biometrika, 31, 9-12. https://doi.org/10.1093/biomet/31.1-2.9
[12]
Gulliksen, H. and Wilks, S.S. (1950) Regression Tests for Several Samples. Psychometrika, 15, 91-114. https://doi.org/10.1007/BF02289195
[13]
Lazo, M. and Clark, J.M. (2008) The Epidemiology of Nonalcoholic Fatty Liver Disease: A Global Perspective. Seminars in Liver Disease, 28, 339-350. https://doi.org/10.1055/s-0028-1091978
[14]
Prati, D., Taioli, E., Zanella, A., Della Torre, E., Butelli, S., Del Vecchio, E. and Conte, D. (2002) Updated Definitions of Healthy Ranges for Serum Alanine Aminotransferase Levels. Annals of Internal Medicine, 137, 1-10. https://doi.org/10.7326/0003-4819-137-1-200207020-00006
[15]
Sanai, F.M., Helmy, A., Dale, C., Al-Ashgar, H., Abdo, A.A., Katada, K. and Hashem, A. (2011) Updated Thresholds for Alanine Aminotransferase Do Not Exclude Significant Histological Disease in Chronic Hepatitis C. Liver International, 31, 1039-1046. https://doi.org/10.1111/j.1478-3231.2011.02551.x
[16]
Hayes, K., O’Brian, K. and Kinsella, A. (2017) A Decomposition of the Bradley-Blackwood Paired-Samples Omnibus Test. Communications in Statistics-Theory and Methods, 46, 9892-9896. https://doi.org/10.1080/03610926.2016.1222439
[17]
Friedrich-Rust, M., Ong, M.F., Martens, S., Sarrazin, C., Bojunga, J., Zeuzem, S. and Herrmann, E. (2008) Performance of Transient Elastography for the Staging of Liver Fibrosis: A Meta-Analysis. Gastroenterology, 134, 960-974. https://doi.org/10.1053/j.gastro.2008.01.034
[18]
Friedrich-Rust, M., Rosenberg, W., Parkes, J., Herrmann, E., Zeuzem, S. and Sarrazin, C. (2010) Comparison of ELF, FibroTest and FibroScan for the Non-Invasive Assessment of Liver Fibrosis. BMC Gastroenterology, 10, Article No. 103. https://doi.org/10.1186/1471-230X-10-103
[19]
Carroll, R.J. and Ruppert, D. (1988) Transformation and Weighting in Regression. Chapman and Hall, New York. https://doi.org/10.1007/978-1-4899-2873-3
[20]
Cohen, J. (1992) A Power Primer. Psychological Bulletin, 112, 155-159. https://doi.org/10.1037/0033-2909.112.1.155
[21]
Shoukri, M.M. (2010) Measures of Interobserver Agreement and Reliability. 2nd Edition, Chapman & Hall/CRC, Boca Raton. https://doi.org/10.1201/b10433
[22]
Shoukri, M.M. (2015) Agreement. Encyclopedia of Biostatistics. Wiley, New York.