Using Student Evaluations to Improve Teaching

Student evaluations can be used to improve teaching, and here’s an excellent resource to inform those efforts. Author Guy Boysen writes, “The purpose of this teacher-ready review is to provide a comprehensive, empirically-based guide for the use of student evaluations to improve teaching” (p. 273). His premise is that if teachers are going to base improvement decisions on evaluation data, then they need to be using “scientifically justifiable practices” (p. 27). Specifically, they need to a) use reliable and valid forms, b) have an adequate sample of students, c) analyze the responses systematically, and d) make the results part of an ongoing professional development effort.

There’s a vast collection of studies on student ratings, covering virtually every aspect of instructional evaluation. As has been observed more than once, if you want to believe something in particular about ratings, chances are good you can find a study to support that view. “In order to avoid the potential bias of selecting single studies to fit a predetermined conclusion, this review emphasizes trends identified through meta-analysis” (p. 274). The focus is mainly on summative, end-of-course ratings. However, Boysen is not writing about how these are or should be used by administrators for promotion, tenure, and merit. He’s writing to teachers, offering advice on using rating data for improvement purposes.

Use valid and reliable instruments

If you don’t use valid and reliable instruments, you’re making decisions about instructional changes based on data that may be bogus. Boysen points out that “Many colleges . . . create their own student evaluation measures by haphazardly selecting survey questions with face validity” (p. 275). This matters because research shows that when teachers make improvements based on valid and reliable data, subsequent evaluations show larger gains than those of teachers who aren’t using valid and reliable instruments.

Boysen also tackles the continuing belief by some that student evaluations are not a valid measure of teaching effectiveness. He cites seven meta-analyses that support the validity of student ratings. That ends up being a lot of data to argue against. He also addresses the more current belief that ratings have been rendered irrelevant by students’ senses of entitlement and the consumerism of higher education. Are students now evaluating the quality of the instruction, or are they punishing teachers for failing to satisfy their demands as consumers? He calls this belief interesting but points out that so far it hasn’t been empirically validated. In fact, there is evidence that challenges the belief. It’s provided by a huge study involving over 750,000 classes at nearly 350 different colleges and universities; this study documents that ratings in the 2002–2011 decade were “consistently” higher than they were between 1998 and 2001 (p. 275).

Have an adequate sample size

The big worry here is the low response rates being generated by online data collection and the prevailing view in some quarters that disgruntled students who give low ratings and write negative comments are overrepresented in online data. Response rates for online course evaluations are lower. There’s no arguing that point. Boysen writes that it’s “safe to assume that at least 20 percent fewer students will complete an online versus a face-to-face student evaluation survey” (p. 276). However, Boysen references five studies documenting that online and face-to-face evaluations produce “results of similar magnitude and correlational structure . . .” (p. 276). Furthermore, as far as who completes the online evaluations, research suggests it’s the students with higher GPAs (Boysen references five studies here as well). “Online evaluations do not appear to be dominated by students who earn low grades and who, on average, tend to give lower evaluations of their teachers” (p. 276). Finally, analyses of online and face-to-face comments do not show any differences in the number of students who make comments or in the number of negative or positive comments they provide. In fact, several studies (five citations) show that students are actually writing more—by some estimates, 150 percent more—when they complete online evaluations.

What’s the response rate that teachers should be looking for? It depends on class size and what’s determined to be acceptable as a margin of error. For example, with a stringent 3 percent margin of error and a class size of 20, you’d need a 97 percent response rate. If there were 50 students in the course, you’d need a 93 percent response rate, and for 100 students, an 83 percent response rate. If a 10 percent margin of error is acceptable, then the response for these class sizes would be 58 percent, 35 percent, and 21 percent, respectively. There is no consensus as to what the acceptable margin of error for student ratings might be.

Analyze the results and interpret the data

“Student evaluation results represent scientific data, but research suggests that faculty readily interpret that data without reference to established statistical principles” (p. 278). As an example, Boysen points to the small variations in average scores that lead faculty to conclude they’ve improved or they need to. Error is inherent in any psychological measurement, including these less-than-precise measures of teaching effectiveness. Teachers need to look at the results from multiple sections and across multiple semesters or terms before making big changes.

Then there’s the matter of student comments, which are usually received as an unorganized collection that encourages teachers to look at and respond to individual comments, often over-reacting to negative ones. The advice is to sort through comments systematically, disregarding those with no specific advice (“great teacher, you rock”) and those with negative assessments only offered by one or two students.

Act on the results

“Student evaluations can improve teaching when they are used as part of a process of professional consultation and goal setting” (p. 279). In other words, the research suggests that results should be discussed with a peer or instructional expert. Based on that conversation, faculty should set goals and proceed to implement changes.

This review is an outstanding piece of scholarship. Any faculty member who looks at student rating results and bases improvement decisions on them would be well advised to read and regularly review this article.

Reference: Boysen, G. A. (2016). Using student evaluations to improve teaching: Evidence-based recommendations. Scholarship of Teaching and Learning in Psychology, 2(4), 273–284.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Love ’em or hate ’em, student evaluations of teaching (SETs) are here to stay. Parts <a href="https://www.teachingprofessor.com/free-article/its-time-to-discuss-student-evaluations-bias-with-our-students-seriously/" target="_blank"...

Since January, I have led multiple faculty development sessions on generative AI for faculty at my university. Attitudes...
Does your class end with a bang or a whimper? Many of us spend a lot of time crafting...

Faculty have recently been bombarded with a dizzying array of apps, platforms, and other widgets that...

The rapid rise of livestream content development and consumption has been nothing short of remarkable. According to Ceci...

Feedback on performance has proven to be one of the most important influences on learning, but students consistently...

wpChatIcon

Student evaluations can be used to improve teaching, and here's an excellent resource to inform those efforts. Author Guy Boysen writes, “The purpose of this teacher-ready review is to provide a comprehensive, empirically-based guide for the use of student evaluations to improve teaching” (p. 273). His premise is that if teachers are going to base improvement decisions on evaluation data, then they need to be using “scientifically justifiable practices” (p. 27). Specifically, they need to a) use reliable and valid forms, b) have an adequate sample of students, c) analyze the responses systematically, and d) make the results part of an ongoing professional development effort.

There's a vast collection of studies on student ratings, covering virtually every aspect of instructional evaluation. As has been observed more than once, if you want to believe something in particular about ratings, chances are good you can find a study to support that view. “In order to avoid the potential bias of selecting single studies to fit a predetermined conclusion, this review emphasizes trends identified through meta-analysis” (p. 274). The focus is mainly on summative, end-of-course ratings. However, Boysen is not writing about how these are or should be used by administrators for promotion, tenure, and merit. He's writing to teachers, offering advice on using rating data for improvement purposes.

Use valid and reliable instruments

If you don't use valid and reliable instruments, you're making decisions about instructional changes based on data that may be bogus. Boysen points out that “Many colleges . . . create their own student evaluation measures by haphazardly selecting survey questions with face validity” (p. 275). This matters because research shows that when teachers make improvements based on valid and reliable data, subsequent evaluations show larger gains than those of teachers who aren't using valid and reliable instruments.

Boysen also tackles the continuing belief by some that student evaluations are not a valid measure of teaching effectiveness. He cites seven meta-analyses that support the validity of student ratings. That ends up being a lot of data to argue against. He also addresses the more current belief that ratings have been rendered irrelevant by students' senses of entitlement and the consumerism of higher education. Are students now evaluating the quality of the instruction, or are they punishing teachers for failing to satisfy their demands as consumers? He calls this belief interesting but points out that so far it hasn't been empirically validated. In fact, there is evidence that challenges the belief. It's provided by a huge study involving over 750,000 classes at nearly 350 different colleges and universities; this study documents that ratings in the 2002–2011 decade were “consistently” higher than they were between 1998 and 2001 (p. 275).

Have an adequate sample size

The big worry here is the low response rates being generated by online data collection and the prevailing view in some quarters that disgruntled students who give low ratings and write negative comments are overrepresented in online data. Response rates for online course evaluations are lower. There's no arguing that point. Boysen writes that it's “safe to assume that at least 20 percent fewer students will complete an online versus a face-to-face student evaluation survey” (p. 276). However, Boysen references five studies documenting that online and face-to-face evaluations produce “results of similar magnitude and correlational structure . . .” (p. 276). Furthermore, as far as who completes the online evaluations, research suggests it's the students with higher GPAs (Boysen references five studies here as well). “Online evaluations do not appear to be dominated by students who earn low grades and who, on average, tend to give lower evaluations of their teachers” (p. 276). Finally, analyses of online and face-to-face comments do not show any differences in the number of students who make comments or in the number of negative or positive comments they provide. In fact, several studies (five citations) show that students are actually writing more—by some estimates, 150 percent more—when they complete online evaluations.

What's the response rate that teachers should be looking for? It depends on class size and what's determined to be acceptable as a margin of error. For example, with a stringent 3 percent margin of error and a class size of 20, you'd need a 97 percent response rate. If there were 50 students in the course, you'd need a 93 percent response rate, and for 100 students, an 83 percent response rate. If a 10 percent margin of error is acceptable, then the response for these class sizes would be 58 percent, 35 percent, and 21 percent, respectively. There is no consensus as to what the acceptable margin of error for student ratings might be.

Analyze the results and interpret the data

“Student evaluation results represent scientific data, but research suggests that faculty readily interpret that data without reference to established statistical principles” (p. 278). As an example, Boysen points to the small variations in average scores that lead faculty to conclude they've improved or they need to. Error is inherent in any psychological measurement, including these less-than-precise measures of teaching effectiveness. Teachers need to look at the results from multiple sections and across multiple semesters or terms before making big changes.

Then there's the matter of student comments, which are usually received as an unorganized collection that encourages teachers to look at and respond to individual comments, often over-reacting to negative ones. The advice is to sort through comments systematically, disregarding those with no specific advice (“great teacher, you rock”) and those with negative assessments only offered by one or two students.

Act on the results

“Student evaluations can improve teaching when they are used as part of a process of professional consultation and goal setting” (p. 279). In other words, the research suggests that results should be discussed with a peer or instructional expert. Based on that conversation, faculty should set goals and proceed to implement changes.

This review is an outstanding piece of scholarship. Any faculty member who looks at student rating results and bases improvement decisions on them would be well advised to read and regularly review this article.

Reference: Boysen, G. A. (2016). Using student evaluations to improve teaching: Evidence-based recommendations. Scholarship of Teaching and Learning in Psychology, 2(4), 273–284.