With the academic semester now coming to a close, students all over North America will be filling out student evaluation of teacher forms, or SETs. Typically, students answer a range of questions regarding their experience with a class, including how difficult the class was, whether work was returned in a timely manner and graded fairly, and how effective their professors were overall. Some students will even go the extra step and visit RateMyProfessors.com, the oldest and arguably most popular website for student evaluations of professors. There students can also give their professors an overall rating, rate how difficult the course was, indicate whether they would take a class with the professor again, and, alas, whether one really did need to show up to class or not in order to pass.
While many students likely do not give their SETs much of a second thought once they’ve completed them (if they complete them at all), many universities take student ratings seriously, and often refer to them in determining raises, promotion, and even tenure decisions for early-career faculty. Recently, however, there have been growing concerns over the use of SETs in making such decisions. These concerns broadly fall into two categories: first, whether such evaluations are in fact a good indicator of an instructor’s quality of teaching, and second, that student evaluations correlate highly with a number of factors that are irrelevant to an instructor’s quality of teaching.
The first concern is one of validity, namely whether SETs are actually indicative of a professor’s teaching effectiveness. There is reason to think that they are not. For instance, as reported at University Affairs, a recent pair of studies performed by Berkeley professors Philp Stark and Richard Frieshtat indicate that student evaluations correlate most highly with actual or expected performance in a class. In other words, a student who gets an A is much more likely to give their professor a higher rating, and a student who gets a D is much more likely to give them a lower rating. Evaluations also highly correlated with student interest (disinterested students gave lower ratings overall, interested students gave higher ratings) and perceived easiness (easier courses receive higher ratings than difficult ones). These findings cast serious doubt on whether what students are evaluating is how effective their instructors were instead of simply how well they did or liked the class.
These and other studies have recently led Ryerson University in Toronto, Canada, to officially stop using SETs as a metric to determine promotion and tenure, a decision reached at the behest of professors who long argued for the unreliability of SETs. Perhaps even more troubling than student evaluations correlating with class performance and interest, though, were that SETs showed biases towards professors on the basis of “gender, ethnicity, accent, age, even ‘attractiveness’…making SETs deeply discriminatory against numerous ‘vulnerable’ faculty.” If SETs are indeed biased in this way, that would constitute a good reason to stop using them.
Perhaps the most egregious form of explicit bias in student ratings could be found up until recently on the RateMyProfessor website, which allowed students to rate professor “hotness,” a score that was indicated on a “chili pepper” scale. That such a scale existed removed a significant amount of credibility from the website; the fact that there are no controls over who can making ratings on the site is also a major reason why few take it seriously. The removal of the chili pepper rating came only after many complaints by numerous professors that argued that it encouraged the objectification of especially female professors, and contributed overall to a climate in which it is somehow seen as appropriate to evaluate professors on the basis of their looks.
While it seems unlikely that there exists any official SET administered by a university that includes a chili pepper scale, as the above studies report, the effects of bias when it comes to perceived attractiveness are present regardless. For instance, the above-mentioned reports found that when it comes to overall evaluations of professors, “attractiveness matters,” stating that “more attractive instructors received better ratings,” and that when it came to female professors specifically students were more likely to directly comment on their physical appearance. The study provided one example from an anonymous student evaluation that stated: “The only strength she [the professor] has is she’s attractive, and the only reason why my review was 4/7 instead of 3/7 is because I like the subject.” As the report goes on to state, “Neither of these sentiments has anything at all to do with the teacher’s effectiveness or the course quality, and instead reflect gender bias and sexism.”
It gets worse: in addition to evaluations correlating with perceived attractiveness, characteristics like gender, ethnicity, race, and age all affect evaluations of professors as well. As Freishtat reports, “When students think an instructor is female, students rate the instructor lower on every aspect of teaching;” white professors are rated generally higher than professors of other races and ethnicities, and age “has been found to negatively impact teaching evaluations.”
If the only problems with SETs were that they were unreliable, universities would have a practical reason to stop using them: if the goal of SETs is to help identify which professors are deserving of promotion and tenure, and they are unable to contribute to this goal, it seems that they should be abandoned. But as we’ve seen there is a potentially much more pernicious side to SETs, namely that they systematically display student bias in numerous ways. It seems, then, that universities would also have a moral obligation to revise the way that they evaluate professors.
Given that universities need to evaluate the teaching ability of professors, what is a better way of doing so? Freishtat suggests that SETs, if they are to be used at all, should be only be one component of a professor’s evaluation instead of its entirety: SETs would be more useful, he suggests, if they were accompanied by statements of the instructor’s approach to teaching, descriptions of courses, letters from department heads, and reviews from peers. As part of a more complete dossier SETs could be provided the context required to put them to better use. Regardless of what might constitute a better way of evaluating instructor performance, it seems clear that a system that provides unreliable and biased results ought to be reformed or abandoned.