“Even measures with perfect validity can be rendered useless if they are interpreted incorrectly, and anecdotal evidence suggests that teaching evaluations are frequently the subject of unwarranted interpretations based on assumed levels of precision that they do not possess.” (p. 641) And now there's some research verifying that faculty and administrators do make unwarranted interpretations. “We investigated if differences in teaching evaluations that are small enough to be within the standard error of measurement would still have significant effects on judgments made about teachers.” (p. 641)
There's no question that teaching evaluations matter. Teachers and administrators, including department chairs, deans, and provosts, take them seriously and make decisions based on the results. The problem is that the quantitative nature of the ratings makes the data appear precise and objective. A score of 4.62 is higher than a score of 4.34. However, that does not automatically mean that the faculty member with the higher score is superior, more effective, or a better teacher than the faculty member with the lower score. “As is true of all measurements, the means produced by teaching evaluations are only an estimate of the true score; sources of error—such as small sample sizes, outliers, misinterpretation of questions, and less-than-perfect reliability—interfere with the measure of true scores.” (p. 643) So what looks precise and reliable provides no more than a “veneer of objectivity.” (p. 643)
The huge research exploration of student ratings that occurred during the '80s and '90s clearly delineated appropriate statistical standards for interpreting evaluations. Unfortunately, what has been established empirically is not always implemented in practice, as this research demonstrates.
From three separate but related studies, randomly selected faculty and administrators from randomly selected postsecondary institutions and in randomly selected disciplines read short scenarios that contained the same details except for the teacher ratings, which were slightly higher or lower. For example, in the first study, 57 faculty members considered hiring two potential candidates and allocating a travel award. In the second study, 80 department chairs assessed two untenured faculty members who had implemented an instructional innovation. In the case of the innovations, the overall course rating was 4.20 in one scenario and 3.92 in the second one.
The results of both studies were the same. The rating differences had significant effects on the judgments made about these fictional teachers. Writing about Study 1, the researchers note that “even without critical statistical information needed for interpretation, small increases in an overall teaching evaluation led participants to perceive teachers as significantly more deserving of a merit-based reward; this is a meaningful effect for a teaching evaluation of less than a third of a point on a five-point scale.” (p.647) In Study 2, “small changes in raw means led to statistically significant differences in judgments about teaching techniques.” (p. 649)
In Study 3, the research team decided to include both the means and the standard deviations needed to interpret them. In this case, 48 faculty were asked to imagine they were on a committee in charge of evaluating faculty for reappointment. The scenarios described two faculty members, again with virtually identical teaching assignments, course designs, and instructional methods. A table listed their student evaluation means and standard deviations. “The results of Study 3 provide the most convincing support yet that small differences in fictional teaching evaluations have a significant impact on judgments. Study 3 allowed participants to consider full and realistic information about teaching evaluations. Still, faculty in the study judged variations as small as 0.15 of a point as meaningful. The statistical tests of significance indicate that participants were providing reliably different ratings based on teaching evaluations that were . . . highly unlikely to be reliably different.” (p. 653)
“Despite the importance of teaching evaluations and the simplicity of the principles for their interpretation, the current studies illustrate the relative ease with which faculty members and department heads can be led to make inappropriate generalizations from limited data.” (p. 653) These are not encouraging findings, although for most of us they do not come as a surprise. Let them be a wake-up call, whether you're a teacher looking at your own results, assessing those of a peer, or conversing with the department chair about your own results. If you happen to be serving on a search committee or doing tenure reviews with colleagues, this would be a great article to review and discuss before looking at teaching evaluation results.
Reference: Boysen, G. A., Kelly, T. J., Raesly, H. N. & Casner, R. W. (2014). The (mis)interpretation of teaching evaluations by college faculty and administrators. Assessment & Evaluation in Higher Education, 39 (6), 641-656.