MARCH 1999

End of Course Evaluations Study: Phase I
March 1999

The goal of this document is to report on the preliminary reliability and validity analyses of student perceptions of teaching gathered from our current quantitative end-of-course evaluations. The introduction contains conceptual information about reliability and validity. This is followed by a section that details the procedures and statistical methods followed to analyze end of course evaluations. The final sections contain results and discussion of the analyses.


A mean value across students' responses within a course on any one item may reflect a number of things: (1) the actual quality the item was designed to assess, such as clarity of presentation of course material, (2) biases for or against large courses or lower versus higher level courses, as two examples, and (3) random error. Random error refers to chance fluctuations in scores that have little meaning. Such fluctuations might be due to the time of day evaluations are gathered or the speed with which students complete the evaluations, as just two of many possible examples.

Random error decreases the reliability of measurement. The consequence of such random error is that means are expected to vary by some amount that reflects this randomness rather than the actual quality of teaching. Practically, random error is dealt with by estimating the amount means are expected to vary and then making inferences based not on observed mean values, but on ranges around these means. When computed appropriately, these ranges can tell us the proportion of times a value representing the true quality of teaching (along with any biases which might be present) should be contained within the range. Standard statistical practice dictates that an acceptable range should "capture" the true value of the variable assessed 95 percent of the time. Furthermore, ranges around means should be computed in this way from two or more instructors assessed by the same item. These ranges should then be compared, and a lack of overlap in these ranges can be interpreted as reflecting a reliable, statistically significant difference between instructors. Overlap between the ranges indicates that the mean values, though they may appear different from one another, are not reliably or statistically significantly different from one another. In other words, overlap in ranges indicates that differences between means are likely due to random variation and not to differences in teaching quality.

Biases threaten the validity of measurement. Though taking a range into consideration rather than a mean value controls for random error, but not for systematic error or bias. Thus, a range may reflect a stable quality, but the meaning of this quality can never be exactly known. It could reflect teaching effectiveness. However, it could also reflect biases against lower level or lecture courses, against courses in one division of the college, in favor of one sex of instructor, and so on. Unfortunately, validity is not dealt with as easily as random error. Standard measurement practice dictates two general approaches to assuring high validity. First, multiple types of measures should be taken. Each type of measure will harbor different sets of biases; the hope is that by aggregating across many sources of information these biases will be reduced in proportion to the actual quality that one desires to assess. Second, within any one type of measurement, additional information can be collected on the variables that are suspected to introduce biases. Statistical tests can then be conducted to determine if these variables, such as level of course, are in fact significantly related to estimates of the quality of teaching. If so, then this relatedness or confounding could be corrected statistically in much the same way that demographers might relate years of education to annual income, while controlling for parents' education.

In summary, the primary goals of this study were (1) to attach ranges to students' mean estimates of teaching quality in order to see if our evaluation procedure can detect reliable differences between faculty or courses on the same qualities of teaching, and (2) to relate suspected biasing variables to mean estimates of quality to see if these variables do serve as threats to the validity of our current assessment of teaching quality. A subsidiary goal was to explore the relationship between variables that we might expect to be meaningfully related to teaching quality and students' perceptions of teaching quality. For example, we expect that experience in teaching improves the quality of teaching. If so, then if years of employment by the college are related to end-of-course evaluation ratings our confidence in our assessment should increase.

Procedures and Methods

Approximately 50 faculty members consented to provide their end-of-course evaluation data from the Spring 1998 semester for the purposes of this pilot study. These faculty consented to provide data from 97 courses. Not all of these faculty and courses could be represented in these analyses, however, because every department relies on a different end-of-course evaluation instrument. In order to protect the anonymity of faculty, to allow for validity comparisons across potentially biasing variables, and to increase confidence in the results of these analyses, only those departments' course evaluations that contained items assessing teaching qualities that were also assessed by other departments' forms were used. This resulted in a final sample that included information from 76 courses taught by 45 faculty who have been teaching at Grinnell from 1 to 32 years. These 77 courses included those taught across the divisions of the college (Humanities n = 30, Science n = 16, Social Studies n = 31) and concentrated at the lower levels (100 level n = 29, 200 level n = 27, 300 level n = 17, and 400 level n = 4). These courses were taught by an almost equal number of men and women (men n = 40, women n = 37) who relied on various pedagogical formats (lecture n = 10, discussion n = 8, experiential n = 13, and mixed format n = 41).

Because evaluation forms differ across departments, it was necessary to first select items that were designed to assess the same teaching quality and that recur using similar wording on a number of departments' instruments. A small sample of teaching quality items were selected from the entire pool of items. The items included in these analyses reflected the clarity of course goals, the clarity of presentation of the material, the degree to which the course encouraged critical thinking, the timeliness of graded feedback, the overall effectiveness of the instructor, and the overall value of the course. These items were chosen because they reflect a variety of teaching qualities, from very specific and concrete qualities (e.g., timeliness of feedback) to more general attributes (e.g., instructor effectiveness). The lists below indicate the variations in wording used by different departments in assessing these six qualities of teaching.

1. Course objectives

The overall organization and goals of the course were clearly explained on the syllabus.
The overall organization and goals of the course were clearly explained (in the syllabus and/or by the instructor's comments).
The goals of this course were clearly presented to me.
The overall organization and goals of the course were clear to me.
The goals of this course were presented clearly.
The objectives of the course were clear to me.

2. Clarity of presentation

My instructor covered new and/or difficult material in a clear and intelligible way.
The instructor explained the material clearly.
My instructor covered course material in a clear and intelligible way.
The professor explained course material effectively.
My instructor covered new material in a clear and intelligible manner.
The instructor presented concepts clearly.
My instructor presented new and/or difficult material clearly.
My instructor presented the course material clearly.

3. Course encouraged thought

The course encouraged me to think clearly and critically about the subject.
The course encouraged me to think clearly and critically about the subject matter.
This course encouraged me to think critically about the subject matter.
This course encouraged me to think critically about the subject.

4. Timeliness of feedback

My instructor returned work in a timely manner.
My instructor returned work in a timely fashion.
Exams/assignments were returned promptly.
My work in this course was evaluated in a timely and helpful fashion.
My work was evaluated in a timely fashion.

5. Instructor effectiveness

Overall, this instructor's teaching was effective.
How would you rate this instructor's teaching effectiveness?
How would you rate your instructor's teaching in this course?
How would you rate this instructor's performance overall?
The instructor was an effective teacher.
How would you rate this instructor's teaching?
How would you rate this instructor's teaching overall?

6. Overall course

How would you rate this course overall?
How would you rate the overall quality of this course?
My overall rating of this course is positive.

Departments also currently differ in the types of response scales they favor. Five, six, and seven-point scales are all employed, with some response scales offering "don't know" or "not applicable" options. Standard scores were computed within each type of response scale in order to obtain statistical analyses comparable across the different response scales. In effect, this standardization sets the mean value at 0 for one particular item (and then again for each of the teaching quality items analyzed) across all those courses that rely on one type of scale (e.g., 5-point). This is repeated for courses that rely on another type of scale (e.g., 6-point), and so on. Deviations from these respective means, or courses that received higher or lower than average ratings from students on particular items are indicated by standard scores that are either above or below the average of 0. There is no way to assess the biasing nature of including a 5-point versus a 7-point scale. This is because average values would naturally be higher when using a 7-point scale.

Attention to the details of the response scales and their anchor items also resulted in a number of items that were deselected from this sample. Most response scales were balanced, such that the mid-point of the scale indicated neutrality, points below the mid-point indicated negative evaluations, and points above the mid-point indicated positive evaluations. However, there were a number of items where the response scales were not balanced. For example, students may have been requested to evaluate overall teaching effectiveness on a 5-point scale, with the points anchored by "poor," "fair," "good," "very good" and "excellent." Here the mid-point of "3" is positive rather than neutral. Consequently, the exact same average obtained using a balanced versus an unbalanced scale cannot be assumed to reflect the exact same level of teaching (even assuming perfect reliability and validity).

Results and Discussion

Reliability Analyses. The main goal of the reliability analyses was to attach ranges to average values of students' perceptions of teaching. These ranges represent the upper and lower bounds between which we can expect average ratings to fall, if we repeated our procedures an infinite number of times. It is necessary to examine these ranges rather than actual mean values because means across courses (and items) vary not just due to actual differences in the underlying teaching quality, but also because of random error. The actual size of these ranges depends on two factors--the diversity of student opinion within a course and the number of students in a course. Wider ranges either reflect greater diversity or smaller numbers of students (or both). There are instances in which it is impossible to attach a range to an average value. This occurs when either there is only one student who provides ratings (as did occur in this data set) or when all students provide the exact same numeric response (this also occurred).

Figures 1 though 6 display mean values across students and the associated ranges for all the course data available for analysis. Each range is drawn from the end-of-course data available from one course. Thus, Figure 1 shows that 29 sets of course data were available for the "Course Goals" item. Recall that when ranges do not overlap, their associated mean values are considered statistically different from one another, in this case at the p < .05 level. This means that there is less than a 5 percent chance of mistakenly concluding that the two scores whose ranges do not overlap are different. In other words, on only 5 out of 100 repetitions of the data gathering would we expect a difference this large if the two scores were not different in reality. Careful visual inspection of the ranges pictured in Figure 1 reveals 40 instances where an instructor was rated as significantly clearer in presenting course goals than another instructor. Course 3 was rated significantly higher than courses 4, 7, 9, 10, 11, 12, 14, 26, and 27. Courses 1 and 24 were rated higher than courses 4, 7, 9, 11, 26, and 27. Course 2 was rated higher than courses 7, 9, 26, and 27. Courses 6, 21, and 16 were perceived as better on the "Course Goals" item than courses 7, 9, and 27; numbers 28 and 29 were rated higher than 7, and 9. Finally, courses 10, 15, and 23 were rated significantly higher than course 7.

These 40 instances of significant differences among the ratings for "Course Goals" may seem impressive, but there were a total of 378 available comparisons. In other words, only 10.58 percent of the possible comparisons yielded differences that would be considered significantly different at conventional levels. Further, the logic of statistics is probabilistic; some of the instances that yielded significantly different results (e.g., nonoverlapping ranges) are likely due to random error. That is, the 5 percent error rate accumulates across multiple statistical tests.

Figures 2 through 6 display the ranges associated with "Clarity of Presentation" (Figure 2), "Course Encouraged Thought" (Figure 3), "Timeliness of Feedback" (Figure 4), "Instructor Effectiveness" (Figure 5), and "Overall Course Evaluation" (Figure 6). Inspection of the ranges displayed in these figures demonstrates the following numbers and percentages of courses rated significantly differently:

  No. of Sig. Diffs. No. of Comparisons % of Sig. Diffs.

(Course Goals



Clarity of Presentation



Course Encouraged Thought



Timeliness of Feedback



Instructor Effectiveness



Overall Course Evaluation




Across 6 Items




This table indicates that there is variation across course quality items in their ability to reveal differences between faculty. Faculty were more likely to be rated differently in the more abstract qualities of encouraging thought and overall effectiveness, for example, than they were on the more concrete dimensions of clearly presenting course goals and course material.

Validity Analyses. The two goals of the validity analyses were to make sure that students' ratings do not vary across variables that might serve as contaminants or confounds at the same time that they do vary across other variables which should meaningfully relate to the quality of teaching. Average ratings for the six individual items were computed for each grouping of: division in which course was taught, level of course, format of course, sex of instructor, and the inclusion of a "don't know" option on response scales in order to address the contaminants issue. Results are depicted in Tables 1 though 5. These tables display the standardized mean values for each grouping, the number of courses included in each grouping, and the value of the statistical tests. There are many average values that appear different, but that are not reliably or statistically different. The row labeled "p-value" indicates the significance level; p-values less than or equal to .05 indicate that the means in that column are significantly different from one another. As before, this means that there is less than a 5 percent chance of mistakenly concluding that these means are different from one another. Inspection of the tables reveals surprisingly few statistically significant differences. Courses in the Social Studies division were rated significantly less likely to encourage thought than courses in the Humanities division (see Table 1) and women faculty were perceived to be significantly more clear in the classroom than men faculty (see Table 4). Students also showed a nonsignificant tendency to rate instructors of Social Science courses as providing more timely feedback on graded items than instructors of Science courses (see Table 1). Further, 100 level courses were associated with longer feedback times than courses at the other three levels (see Table 2). In summary, few significant differences emerged when considering potential contaminants of students' evaluations of teaching.

The number of times an instructor had taught a course (categorized into the following groupings of approximately equal numbers of courses: 1 to 2 times, 3 to 8 times, or 9 to 50 times), the number of students providing ratings for the course, and the number of years instructors had been employed by the college (categorized according to traditional contract status: 1 to 3 years, 4 to 6 years, and 7 or more years) were used to see if students' evaluations vary along these values. These results are displayed in Tables 6 through 8b. A larger number of students providing ratings was associated with significantly longer grading feedback times (see Table 7). As well, instructors employed by the college for only 1 to 3 years were significantly slower in providing feedback than others who had been employed 4 or more years (see Table 8a). Tenured faculty (those employed 7 or more years) were rated as significantly more effective than those in the first years of their appointments (1 to 3 years--Table 8a, 1 to 6 years--Table 8b). Courses taught by tenured faculty were also rated as encouraging more thought than those taught by nontenured faculty (see Table 8b). Finally, there was a nonsignificant trend for those instructors who had offered a course more than 9 times to be rated as more effective than instructors who had offered a course less frequently (see Table 6). While these analyses were not focused in the sense that there were no specific a priori predictions about how particular variables (i.e., tenured versus not; number of students) would relate to particular teaching quality variables, the results do reveal that our end-of-course evaluation instruments may be discerning meaningful differences among groupings of courses, such as those taught by tenured versus nontenured faculty.

General Discussion

It seems as if we have two goals with end-of-course evaluations--performance evaluation (reappointment, promotion, salary recommendations) and development of teaching. Does our current end-of-course evaluation procedure adequately meet either of these goals? We can begin to address this question by assessing the reliability and validity of the end-of-course evaluation measures. With regard to the reliability of our estimates, there was considerable variation around mean ratings of teaching quality. This variation was due both to differences in opinion among students in the same course and to our small class size. The result of this large variation is that our measures were able to statistically differentiate between extremely low and extremely high average ratings, but not between the lower or higher ratings and more moderate scores.

Discriminant and convergent validity issues were also addressed by examining (1) whether average student ratings differed depending on confounding variables such as division and level of course, and sex of instructor (discriminant validity) and (2) whether ratings differed according to variables we hope to be related to teaching quality (convergent validity). A large number of statistical tests resulted in little evidence of confounding. Specifically, humanities courses were evaluated as significantly more likely to encourage thought than social science courses and women faculty were evaluated as significantly more clear than men in their presentations of course material. These differences should be examined further to see if they are replicable or merely the result of random variation. Other variables that we would hope to be related to teaching quality, such as the number of times a course had been taught by an instructor, the number of students in a course, and the number of years that an instructor had been teaching at Grinnell were in fact significantly or marginally related to average ratings. For instance, overall effectiveness ratings increased when a course had been taught more than eight times by an instructor and a larger number of students was associated with slower evaluative feedback from instructor to students. Finally, courses taught by tenured faculty were perceived by students as encouraging more thought and these faculty were characterized as more effective.

There might be reasons to assume that the sample of course evaluation data included in this study is biased, due to the lack of representativeness. For example, faculty concerned with low evaluations might not have consented to have their data included, despite promises of anonymity. However, random error or reliability is assessed within each course individually, so that the ranges associated with the average of students' perceptions remain unaffected by the number of faculty participating in this study. Nonetheless, a broader sampling of faculty would result in a greater ability to detect differences between average course evaluations if the current sample is biased by including only those faculty who accurately expected high ratings. Thus, our present instruments might be doing a better job of differentiating among individual faculty than these data were able to demonstrate.

With regard to the validity analyses, our two concerns are detecting differences between groupings of courses that do not exist in actuality and not detecting differences that do in fact exist. For the available sample of course data to be biased toward either of these possibilities, it would have been necessary for particular groups of faculty to selectively choose or not choose to participate. For instance, if Science Division faculty expected low evaluations, then a proportionately greater number of science faculty might not have participated in the study. Consequently, these analyses would be limited in their ability to detect a depression in students' ratings of science courses. The distribution of courses represented in the sample analyzed here suggests that division is the one variable on which our data were not representative; proportionately fewer science faculty than those from other divisions participated in this study. However, it is important to guard against overinterpreting divisional representation; two of the five departments in the Science Division employ qualitative instruments that could not be analyzed in this study. It is likely that the underrepresentation of science courses is due to the analysis of quantitative instruments, and not to a selection bias. Furthermore, if faculty who expected low evaluations chose not to participate, and if these faculty were evenly dispersed across divisions, course level, gender of instructor, pedagogical format, etc., then the inclusion of their data would not alter the present conclusions. That is, there is scant evidence that students' evaluate one type of course differently from another type of course. This conclusion would not be compromised by including a greater number of poorly evaluated courses of all types.

A broader concern that could have hindered the ability of these analyses to detect meaningful differences in students' evaluation of individual courses, as well as groupings of courses, is the variety of question wordings employed by departments. Some of the variation present in the current data is due to variation in item wording. If this variation is great, and at present there is no way to estimate the size of the wording effect, then both the reliability and validity analyses are less apt to detect differences among individual courses or groupings of courses. In other words, it is practically impossible to detect these differences in students' perceptions when using a flexible yardstick--represented in our case by varying evaluation instruments, items, and rating scales. These differences in the phrasing of items across departments' forms could account for fact that the validity analyses did not reveal inappropriate discrepancies in students' ratings of particular types of courses.

The measurement issues change a bit when one shifts from the purpose of evaluation to that of development. Though these statistical analyses did not address the efficacy of our current instruments for course or faculty development, there is reason for concern. As one example, the selection of items can be engineered to move average ratings up or down on a quantitative scale. This is unfortunate, as many of the very characteristics useful for course development might work at cross purposes to characteristics useful for evaluation. Thus, items designed to elicit constructive criticism might be most useful for course development but would surely be avoided by the evaluation savvy faculty member. This suggests that we might be best served by relying on two methods to achieve the dual goals of evaluation and development.

It is difficult to reach any conclusion other than that our current instruments and procedure are inadequate for both the evaluation of faculty and for course development. Though end-of-course evaluations have the potential to provide useful differentiating information about faculty classroom performance as perceived by students, our varied instruments and procedures militate against this. Finally, this document addresses only one method of evaluating instructors' classroom performance. Because no single procedure is without flaws, it is important to keep in mind that we should always rely on multiple sources of information.