This article previously appeared in the November 1993 issue of The Teaching Professor, where it was excerpted and reprinted with permission from The Center for Teaching Effectiveness Newsletter at the University of Texas at Austin.
Because so much depends upon the evaluation of a student’s learning and the resulting grade, it is in everyone’s interest to try to make the evaluation system as free from irrelevant errors as possible. Borrowing from the evaluation literature, I propose the four R’s of evaluation—Relevant, Reliable, Recognizable, Realistic—as ways to ensure the quality of our evaluation systems.
In the jargon this is known as the validity of an evaluation method. This means that any activity used to evaluate a student’s learning must be an accurate reflection of the skill or concept which is being tested. What are the characteristics of a relevant evaluation?
Oddly enough, one characteristic that might seem very mundane is that the evaluation activity must appear related to the course content (known in the jargon as face validity). A common student complaint is that tests are not related to the course content or what was presented in class. Although we know that what we assign is directly related to the course, the students often don’t see the connection. And, student impressions aside, the more obvious the connection, the higher the probability that we really have a valid evaluation activity.
A second characteristic of relevant evaluations is that they are derived directly from the objectives (known in the jargon as content validity). The most obvious way to achieve this is to follow the objectives as closely as possible in selecting activities.
If your objective is that the students will be able to select the appropriate statistic for analyzing a given set of data, the evaluation should provide them with a data set and have them select the analysis. It could take many forms:
- an in-class exam where no actual calculations are done,
- an out-of-class exam where no actual calculations are done,
- an out-of-class homework assignment involving extensive calculations,
- a component of a large-scale semester-long project,
- an in-class exercise done in groups with class-generated data.
All of these alternatives represent relevant tests of that objective.
Another characteristic of a relevant evaluation is ho well performance on that evaluation predicts performance on other closely related skills, either at the same time (concurrent validity) or in the future (predictive validity). If the skill you are supposedly testing should be highly correlated with some other skill which you are also testing, chart the students’ performances on each and see if they follow the same pattern.
To use a simplified example, we can say that the ability to add two single-digit numbers is a precursor to, and therefore highly correlated with, the ability to add two two-digit numbers. Therefore, students who do poorly on the former should not be able to do well on the latter. If they do, then one of the two tests is not measuring what it is supposed to be measuring and is therefore not relevant to the additional skill we are trying to evaluate.
The second aspect of an evaluation activity is how reliably or consistently it measures whatever it measures without being affected too much by the situation. A student’s grade should not hang on a single performance or on the mood of the person making the judgment. Of course, no system is perfectly reliable and will produce exactly the same evaluation of performance each time, but the goal here is to eliminate as many sources of error as possible.
The three biggest sources of error in reliably evaluating a student are:
- poor communication of expectations,
- lack of consistent criteria for judgment, and
- lack of sufficient information about performance.
Poor communication of expectations means that poor student performance may be the result of the student’s failure to correctly interpret the task requirements. In written exams this usually is caused by ambiguous questions, unclear instructions, corrections given verbally during the test, and so on. In each case, a bad grade is the result of the student not understanding the question. The student may in fact know the material.
Lack of consistent criteria for judgment means that, if the same performance were to be judged a second time by the same grader, or if another grader evaluated it, it might not receive the same grade because the basis for judging was unclear. The clearer the criterion for judging a student’s performance, the more reliable the evaluation becomes.
For example, one real strength of multiple-choice tests is that the grading is very reliable. Either the students marked the correct answer or they didn’t; very little is left to the judgment of the grader. On the other hand, essay tests are notoriously unreliable unless the instructor takes pains to make the criteria explicit and keeps checking to make sure he or she is not straying too far from the preset criteria.
Lack of sufficient information is the third source of error in evaluating students, not just in terms of the amount of information, but also in terms of variety of information sources. Not everyone excels in every format. Using only one format may introduce a source of bias for or against some students and lower the reliability of an evaluation.
Our third R is the need for the evaluation system to be recognizable to the students. By this we mean that students should be aware of how they will be evaluated and their class activities should prepare them for those evaluations. Testing should not be a game of “Guess what I’m going to ask you.”
Students don’t mind “hard” tests as long as there are no surprises and they can recognize the relationship of the test to the course. Some instructors may criticize this as “teaching the test,” but in reality the test should be the best statement of the course expectations and therefore should mirror the teaching. Furthermore, few courses are taught at such a low level that tests are verbatim transcripts of the class or text; rather they are interpretations or new examples of the class or text material.
All of the above activities require work, on the part of either the students or the teacher. So, to avoid burning out either, the final R is that the evaluation system should be realistic: the amount of information obtained is balanced by the amount of work required. Too often we forget that our students are taking three to four other courses along with ours.
What is realistic? Unfortunately, no one can give a blanket answer to that question. I can say that several smaller assignments tend to be more valuable than one large assignment. Alternatively, if a large assignment is called for, spreading it out across the semester and requiring components to be handed in periodically is a good technique, both pedagogically and administratively.