%0 Journal Article %T Inter-rater and test每retest reliability of quality assessments by novice student raters using the Jadad and Newcastle每Ottawa Scales %A Carolina Oremus %A ECT & %A Cognition Systematic Review Team %A Geoffrey B C Hall %A Margaret C McKinnon %A Mark Oremus %J - %D 2012 %R 10.1136/bmjopen-2012-001368 %X Introduction Quality assessment of included studies is an important component of systematic reviews. Objective The authors investigated inter-rater and test每retest reliability for quality assessments conducted by inexperienced student raters. Design Student raters received a training session on quality assessment using the Jadad Scale for randomised controlled trials and the Newcastle每Ottawa Scale (NOS) for observational studies. Raters were randomly assigned into five pairs and they each independently rated the quality of 13每20 articles. These articles were drawn from a pool of 78 papers examining cognitive impairment following electroconvulsive therapy to treat major depressive disorder. The articles were randomly distributed to the raters. Two months later, each rater re-assessed the quality of half of their assigned articles. Setting McMaster Integrative Neuroscience Discovery and Study Program. Participants 10 students taking McMaster Integrative Neuroscience Discovery and Study Program courses. Main outcome measures The authors measured inter-rater reliability using 百 and the intraclass correlation coefficient type 2,1 or ICC(2,1). The authors measured test每retest reliability using ICC(2,1). Results Inter-rater reliability varied by scale question. For the six-item Jadad Scale, question-specific 百s ranged from 0.13 (95% CI ˋ0.11 to 0.37) to 0.56 (95% CI 0.29 to 0.83). The ranges were ˋ0.14 (95% CI ˋ0.28 to 0.00) to 0.39 (95% CI ˋ0.02 to 0.81) for the NOS cohort and ˋ0.20 (95% CI ˋ0.49 to 0.09) to 1.00 (95% CI 1.00 to 1.00) for the NOS case每control. For overall scores on the six-item Jadad Scale, ICC(2,1)s for inter-rater and test每retest reliability (accounting for systematic differences between raters) were 0.32 (95% CI 0.08 to 0.52) and 0.55 (95% CI 0.41 to 0.67), respectively. Corresponding ICC(2,1)s for the NOS cohort were ˋ0.19 (95% CI ˋ0.67 to 0.35) and 0.62 (95% CI 0.25 to 0.83), and for the NOS case每control, the ICC(2,1)s were 0.46 (95% CI ˋ0.13 to 0.92) and 0.83 (95% CI 0.48 to 0.95). Conclusions Inter-rater reliability was generally poor to fair and test每retest reliability was fair to excellent. A pilot rating phase following rater training may be one way to improve agreement %U https://bmjopen.bmj.com/content/2/4/e001368