LiberalArtsOnline Volume 6, Number 1
January 2006
I recently reviewed course evaluations forms from my students, sorting through numerical ratings and trying to decipher pages of written comments. After reading this month's essay, I looked at the information with a different eye. Overall, institutions spend a lot of time collecting, compiling, and interpreting student course evaluations. Faculty members use information from the evaluations to improve their teaching and their courses. And institutions often use student course evaluations in the faculty evaluation process. This month, directors of institutional research, Carol Trosset (Hampshire College) and Scott Baumler (Grinnell College), discuss their research on what student course evaluations really tell us about teaching effectiveness. Their findings have important implications for how we interpret and use this data.
--Kathleen S. Wise, Editor
--------------------------------------
What We Really Learn from Student Course Evaluations
by Carol Trosset
Director of Institutional Research
Hampshire College
and Scott Baumler
Director of Institutional Research
Grinnell College
Course evaluations are ubiquitous in higher education. Although the methods range from paper-and-pencil surveys to online questionnaires, the basic process is the same— institutions ask students to rate their courses at the end of every term. Questions on these forms tend to ask about the quality of teaching and how much students learned. Does this then mean that student course evaluations provide meaningful measures of teaching effectiveness? We used data from course evaluations at Grinnell College to investigate this question and discovered that what we really learn from course evaluations has more to do with what students want to tell us than with what we ask them. Furthermore, our research suggests that student ratings appear to be more appropriate for the evaluation of some types of courses and professors than they are for others. These findings have significant implications for the faculty evaluation process, and they underscore the importance of coupling student ratings of instruction with other methods of faculty review.
Research Context and Overview of Results
Student ratings of instruction have been widely studied. The articles in Theall, Abrami, and Mets (2001) provide a good overview of studies that support the validity of student ratings as a measure of teaching effectiveness, and of some that do not. Many of these studies investigate the validity of ratings by using course section as the unit of analysis and then testing the correlation of average scores with other indicators of teaching effectiveness (such as test performance, peer reviews, or alumni ratings).
Our research was conducted in 1999 at Grinnell College, a small private liberal arts college in Iowa. It differs from previous studies in that it investigates the relationships between Likert scale responses and written text. We combined statistical analysis of students’ numeric ratings with qualitative analysis of students’ comments. Rather than looking at averages, we focused on patterns of scores and on the content of comments, relating them to each other in a cluster analysis.
The Grinnell course evaluation form (newly developed by a faculty committee) had six questions, each of which used a six-point Likert scale and invited comments. The items were as follows:
1. The course sessions were conducted in a manner that helped me to understand the subject matter of the course.
2. The instructor helped me to understand the subject matter of the course.
3. Work completed with and/or discussions with other students in this course helped me to understand the subject matter of the course.
4. The oral and written work, tests, and/or other assignments helped me to understand the subject matter of the course.
5. Required readings or other course materials helped me to understand the subject matter of the course.
6. I learned a lot in this course.
Student ratings from 610 distinct course sections were collected. The yield rate for the questionnaires (number of completed forms ÷ total enrollments) was 89%. Across all students, all courses, and all questions, 82% of the responses were moderately agree or strongly agree. The frequency of positive responses seems to indicate that, in the aggregate, students generally think highly of instruction at Grinnell. These findings are corroborated by senior survey results (HERI’s College Student Survey and HEDS’ Senior Survey) in which, every year, 85% to 98% of graduating students indicate they were satisfied or very satisfied with the instruction they received. In other words, we have additional evidence that supports the course ratings: Grinnell students are generally very satisfied with classroom instruction.
Can we, however, move directly from this information to conclude that student ratings are a good way to evaluate the quality of instruction in particular courses? Has the questionnaire been operationalized such that we can make legitimate inferences about the effectiveness of individual teachers? These questions raise the issue of "construct validity"—whether the responses accurately characterize the specific concepts the questionnaire was designed to measure, or whether the survey is unintentionally measuring something else.
Statistically, we found no significant biases in the quantitative data based on gender, ethnicity, class year of students, or rank or department of instructors. Consistent with other research, there was some tendency for smaller classes and upper-level classes to receive higher marks, and expected grades had a modest correlation with student ratings. The effect sizes (statistical measures of how much these factors affected the overall results) were quite small. However, this lack of bias still does not ensure that the responses gathered on the surveys really address the specific questions that were asked.
People often assume that if a survey question is asked clearly, those filling out the survey will answer the exact question that was posed, but this is not always the case. In this study, we used students’ comments as indicators of what they really had in mind when answering the questions. Obviously, not all thoughts are described in comments, but we do know that anything students commented on is something they thought about. By collecting many comments written by many students in a variety of courses, and grouping them by topic, we can develop a list of the factors that students (as a population, not necessarily as individuals) think about when evaluating courses and professors.
Content Analysis of Students’ Comments
We drew a sample of 40% of the 610 course sections to analyze student comments. The sample included courses from each department at each level. Overall, nine topics appeared across students’ comments. The nine topics were professor availability, niceness and approachability, energy/enthusiasm, (apparent) professor knowledge level, how well the class sessions were run, whether the student liked the chosen classroom format, whether the professor helped the student understand the course materials better, whether the course made the student think, and whether the student’s skills increased.
These nine items, especially the first seven, were mentioned (positively or negatively) many times on the forms. Given that we surveyed about a thousand students three or four times each, and given the breadth of the sample of courses analyzed, this list can be thought of as a composite student model of what constitutes good teaching. Not every student will think about all of these factors, but a typical group of students would probably consider this range of things when they complete course evaluations, and they are likely to consider them regardless of exactly what questions are asked on the forms.
The above list of students’ preferred topics raises interpretive challenges for the quantitative data from the evaluation forms, given that the qualitative responses did not always correspond to the specific questions asked. In particular, both extremely high and extremely low scores were often characterized by comments relating to instructor personality and/or student expectations, rather than the specifics of what the students learned. For example, here are two comments given for the second question on the evaluation ("The instructor helped me to understand the subject matter of the course") that accompanied ratings of 6 (strongly agree):
These are both very positive comments, but the first reflects on how much the student learned, while the second simply notes that the professor was readily available to students—which is certainly expected of faculty members at liberal arts colleges, but is not necessarily equivalent to effective teaching.
We found similar discrepancies in comments accompanying lower scores. Below are two, again for the second question, where students assigned the very rare scores of 2 or 1 (moderately or strongly disagree):
The first comment makes a specific point that relates to the clarity of expectations and instructions in the courses. The second, on the other hand, may not be a legitimate criticism if the instructor’s goals were centered on detailed analysis and in-depth understanding, while the student expected a broader exploration of materials.
Different Types of Courses
A cluster analysis (using Ward’s method) enabled us to identify groups of courses with similar patterns of numeric responses to all six questions. Four clusters were readily identified using only the numeric responses. The "reality" of these clusters was verified when it turned out that courses in each cluster had a distinctive pattern of student comments (i.e., the qualitative analysis corroborated the quantitative analysis). This suggests that there are four "types" of courses as perceived by students. About half of all faculty members consistently appeared in a single cluster; the other half had their courses scattered across clusters, usually two.
Seventy-five percent of the courses fell into the two larger clusters, which might be thought of as corresponding to stronger and weaker courses. They both contain quite representative distributions of classes (across levels and subject areas). Both the consistency of ratings among the students in these classes, and the tendency for their comments to focus on issues directly related to classroom management and student learning, suggest that student ratings are a fairly good way to evaluate the courses in these two clusters.
1. The Typical Good Class cluster included 42% of the courses and received the second highest mean overall score (5.6). Students considered these classes well run, saw the professor as helpful, and described the other students and the course materials as good.
2. The Ambivalent Student cluster included 33% of the courses and had a mean score of 5.1. Here, almost every student expressed ambivalence about the course and/or the instructor. Some readings were good and others were not; the professor had this good quality and that bad one; some activities worked and others didn’t.
The other 25% of the courses fell into the other two clusters, and here we encountered some features suggesting a lower level of confidence in the results.
3. The Charismatic Professor cluster, including 12% of the courses, received the highest mean rating for the course overall (5.9). Comments consistently indicated that the professor was seen to have special personal attributes (being especially nice, available, and energetic). Classes were described as well run and professors as knowledgeable. Only in this cluster were professors given credit for assigning good reading materials (in other clusters students said things like, "I learned more from the readings than from the professor"). Comments in this cluster were also disproportionately likely to say that the students bonded together and discussed the material outside of class, and to see the course as meaningful and relevant to the students’ lives. Courses in the Charismatic Professor cluster came disproportionately from the upper levels (200s and 300s), and professors and students were disproportionately female. Nearly all these courses came from the humanities and social studies divisions, and they often involved ethnic or gender studies. The students’ focus on their own emotions and on instructors’ personalities suggests that these courses are being evaluated in ways that are less intellectually reflective than are the courses in the first two clusters.
4. The Controversial Class cluster included 13% of the courses. It received the lowest overall mean scores (4.5); however, this low mean often resulted from bimodal distributions of responses. Some students thought these courses and professors were excellent (their comments often sound like those in the Charismatic Professor cluster) while others thought they were terrible. Members of the latter group often disliked how the class sessions were run, and were likely to criticize other students. The mix of these two response types pulls down the average score. These courses tended to be more from the lower levels (mainly 100s, some 200s). Anecdotal evidence suggests this bimodal response pattern may tend to occur when student expectations about how a particular subject will be approached were violated (for example, when quantitative analysis was required in a history course). The wide disagreement among students in these courses raises the tricky issue of which students’ views should be considered the most consequential. Averaging the scores would almost certainly obscure whatever "really happened" in these courses.
The existence of the Charismatic Professor and Controversial Class clusters emphasizes the need to be cautious about assuming that high and low scores necessarily correspond to good and bad teaching. At least three issues appear to make courses especially subject to these problematic judgments: strong personal rapport and charisma, student expectations of course content that do not match the instructor’s, and the teaching of ideologically-charged topics (such as gender or race).
Implications and Recommendations
When using student ratings, it is important to remember their qualitative limitations. In particular, students may evaluate faculty members using extraneous factors or criteria other than the ones deemed important by the institution. The occasions on which they do so (such as the www.ratemyprofessor.com question about perceived sexual attractiveness) contribute significantly to giving student ratings a bad reputation. Unrelated judgments may appear regardless of the specific questions asked.
The multidimensionality of student ratings suggests that they primarily address some general domain relating to fulfillment or gratification. Student satisfaction, however, is an incomplete indicator that may not reveal whether learning goals are actually achieved. Satisfaction is an important dimension to consider (a satisfied student may be more engaged and motivated; satisfaction may come through the achievement of learning goals), but conceptually, it could be irrelevant to a particular dimension of interest. (A soldier in basic training might not be very content with the countless hours spent practicing tourniquets on the legs of fellow trainees, but that skill might save a life.) Furthermore, satisfaction may only manifest itself well after the course has been completed. After all, students who are new to a subject area cannot yet be considered good judges of what they ought to be learning. The dual nature of students as both users of learning services and products of an institution pushes colleges to seek indicators more sophisticated than flat measures of satisfaction. This is, in part, why student learning assessment has come to the fore—we need to know more about the success of our teaching efforts than simply whether students are satisfied.
Faculty members generally agree that student feedback helps them improve their teaching. On the other hand, our findings strongly suggest some limitations in using student ratings as a direct measure of the quality of teaching or as a proxy for learning outcomes. When courses fall into the Charismatic Professor and Controversial Class clusters, it is especially important that professors be evaluated using multiple methods, which may include reviews of syllabi, department chair evaluations, classroom visits by colleagues, teaching portfolios, alumni surveys, and/or exam and project reviews.
Note:
A more detailed version of this research, including more statistical analysis, was presented in 1999 to the Association for Institutional Research in the Upper Midwest, and in 2000 to the Higher Education Data Sharing Consortium.
Reference:
Theall, M., P. Abrami, L. Mets, editors. 2001. "The Student Ratings Debate: Are They Valid? How Can We Best Use Them?" New Directions for Institutional Research, Number 109. San Francisco: Jossey-Bass.
--------------------------------------
Direct responses to lao@wabash.edu. We will forward comments to the author.
--------------------------------------
The comments published in LiberalArtsOnline reflect the opinions of the author(s) and not necessarily those of the Center of Inquiry or Wabash College. Comments may be quoted or republished in full, with attribution to the author(s), LiberalArtsOnline, and the Center of Inquiry in the Liberal Arts at Wabash College.