Paper 98-1 of the Edgeworth Series in Quantitative Behavioral Science

 
What is the minimum necessary class size for
statistically valid statements from teaching evaluations:
Issues of sample size, confidentiality, and the influence
of "abberrant" or outlying responses

 

Bruno Zumbo, Ph.D.
University of Northern British Columbia
 



What is the minimum necessary class size for statistically valid statements from the teaching evaluations? This is a commonly asked question in implementing a teaching evaluation system. I have been asked to respond to this question for our teaching evaluation system at UNBC. This question, as I have interpreted it, implies that there is a minimum class size at which the teaching evaluation results should not be interpreted, and that this minimum can be determined on the grounds of statistical theory. Let me respond to this question.

 A. The question, as posed, cannot be answered

 There may be a minimum class size at which the teaching evaluations should not be interpreted but that minimum cannot be determined on the grounds of statistical theory. My reasoning for this previous statement is that the statistics computed are not for formal statistical inferential purposes. That is, the role that the statistics play in the teaching evaluations is to describe (in an accurate and honest manner) the data at hand and hence does not involve going beyond that data at hand (i.e., inference).

Let me note, however, that there is a type of inference that is occurring in the use of the teaching evaluations but it is not one that involves sample size, per se. Let me try and clarify this matter. Consider for the moment that there are two dimensions of generalization (or inference):

(a) Inference over the dimension of individuals such that we try to infer from the sub-group of individuals to a larger group. As I state above, this is not a concern in our use of teaching evaluations.  If it was, the issue would be one of the exchangeability of the surveyed group relative to the entire class therefore not solely of sample size, per se.

(b) Inference over the dimension of items (i.e., questions) to a more general variable. In our case one may argue that this variable is, for the lack of a better term, 'teaching effectiveness'. So the issue here is how can we generalize from the items we have to 'teaching effectiveness'.

The question of class size implies consideration of the former whereas the latter is a an issue of measurement validity (i.e., are we measuring what we are purporting to measure and can we legitimately make the desired measurment inferences). The latter form of inference was discussed during the development of the UNBC evaluation form and had the approval of the Faculty Association -- this doesn't mean the matter is closed but rather that it is not what the current question involves.

B.  Reworking the question

Returning to the original question then what we are interested in statistically is having a descriptor (a descriptive statistic) that tells us the general tendency of the respondents -- hence a descriptor that is not unduly influenced by a score that is extreme and unusual.  That is, one wants a resistant or sturdy descriptor of the data at hand. Interestingly, the mean (which is used at UNBC as the descriptor) is notoriously lacking in resistance to abberant data points.

The question of sample size tacitly assumes that we are making an inference and hence is asking at what point do we have enough data to make that inference. As I have stated above this not an issue with teaching evaluations because there is nothing to infer to -- our purpose is to simply describe the data at hand.

As I imply above, however, this does not mean that the question of minimal sample size is unimportant or unanswerable. Instead, I would argue that the question may be posed as one of protection of students' confidentiality and hence the students' freedom to express their opinion free of being identified.

Having recast the question in this manner leads us to a simple answer. Clearly, a class size of one student does not protect confidentiality. Beyond a single student class I don't know of what a defensible cut off for maintaining confidentiality might be. I definitely wouldn't conduct evaluations for a class with only one student enrolled (or alternatively when only one student responded and the professor knows who that student was ... e.g., there are three students in a class and only one attends class the day of the evaluation).

C. Minimizing the impact of abberrant or outlying responses

Regardless of the above rationale, many institutions practise a form of pre-screening so that classes of a particular size are not evaluated. For the purpose of this section of this essay let us imagine a scenario in which a class does not participate in the evaluation process unless it has a minimum class size of six.

The current practice of not evaluating courses with fewer than six students appears indefensible. The number six carries no evaluative nor statistical meaningfulness (nor would any other number). The purpose appears to be to circumvent the perceived problem that small samples can be highly affected by unusual observations (scores). The implication being that if an instructor has more than six students then the estimate of the average rating is more stable.

However, it is commonly known in the statistical literature that it is not the sample size, but the choice of measure of central tendency (e.g., the average or the median) that affects the stability of the descriptive measure. The average is a very sensitive measure of the central tendency of a group of scores,irrespective of sample size. That is, the average is greatly affected by even one unusual observation (outlier). Let us place this in the context of an example. A professor has taught 5 different classes with 15 students in each class. The breakdown of the results is given in Table 1.

Table 1. An example of the insensitivity of the median to aberrant data points.
 
Class Excellent
(5)
Good
(4)
Acceptable
(3)
Poor
(2)
Very Poor (1) Mean
Rating
Median
Rating
5% Trimmed
Mean
A 15 (100%) 0 (0%) 0 (0%) 0 (0%) 0 (0%) 5.00 5.00 5.00
B 14 (93%) 0 (0%) 0 (0%) 0 (0%) 1 (7%) 4.73 5.00 4.93
C 13 (87%) 0 (0%) 0 (0%) 0 (0%) 2 (13%) 4.47 5.00 4.63
D 12 (80%) 0 (0%) 0 (0%) 0 (0%) 3 (20%) 4.20 5.00 4.33
E 11 (73%) 0 (0%) 0 (0%) 0 (0%) 4 (27%) 3.93 5.00 4.04
 
The columns of Table 1 are divided into three parts: the class identification, the breakdown for the evaluation scale, and the summary statistics (mean, median, and 5% Trimmed Mean). By looking down the last two columns it is evident that average rating is reduced as more and more students choose the "very poor" category. However, at the same time the median rating is not altered. It is clear from the above example that a minority (in the most extreme case 27%) of the students would certainly reduce this professor's average rating. However, the median rating is a far more stable and representative descriptor of the central tendency (middle or center) of a set of the scores. It is important to recall that if one were to arrange a set of scores in increasing magnitude the median is the middle score (representing the 50th percentile).

Clearly, from Table 1 one can see that the mean is overly sensitive to abberrant responses whereas the median may lack sensitivity.  It could be argued that Classes A and E should not have the same summary score, which is how the median depicts them. The trimmed mean as a measure central tendency seems to better reflect the distribution of scores.  The trimmed mean rests between the mean (too sensitive) and the median (too insensitive) in terms of sensitivity to abberrant scores.  In the end, in statistical terms, the median or the 5% trimmed mean are both viable and useful measures. The 5% trimmed mean may be the most useful in our context at UNBC. The 5% trimmed mean shares the property with the median of some insensitivity to outliers, and at the same time is not so insensitive as to downweight (in fact, discount) less than 50% of the distribution.  See Lind and Zumbo (1993) for further discussion of robust estimators.

Table 2. An example of how sample size does not alter the attenuating affect of low scores.
 
Class 

size

Excellent (5) Good 

(4)

Acceptable 

(3)

Poor 

(2) 

Very Poor 

(1)

Average 

rating

Median 

rating

3 2 (67%) 0 (0%) 0 (0%) 0 (0%) 1 (33%) 3.67 5.00
6 4 (67%) 0 (0%) 0 (0%) 0 (0%) 2 (33%) 3.67 5.00
30 20 (67%) 0 (0%) 0 (0%) 0 (0%) 10 (33%) 3.67 5.00
60 40 (67%) 0 (0%) 0 (0%) 0 (0%) 20 (33%) 3.67 5.00

Table 2 is similar in format as Table 1, however, in Table 2 the point is self-evident that the sample size is not important in considering the effect of outliers. That is, irrespective of the sample size the median is always 5 whereas the average is always attenuated as shown in Table 1. Therefore, irrespective of sample size the value for the mean and median do not change.

Conclusion

Clearly, then, the question of minimum class size is not answerable as commonly posed. It really isn’t an issue of sample size. Nor is it simply an issue of minimizing the impact of abberrant response by having a minimum sample size. This latter issue is one of choosing an appropriate descriptor. As I state above, the answer then is the following: Clearly, a class size of one student does not protect confidentiality. Beyond a single student class I don't know of what a defensible cut off for maintaining confidentiality might be. I definitely wouldn't conduct evaluations for a class with only one student enrolled (or alternatively when only one student responded and the professor knows who that student was ... e.g., there are three students in a class and only one attends class the day of the evaluation).
 
References

  • Lind, J. C., & Zumbo, B. D. (1993). The continuity principle in psychological research: An introduction to robust statistics. Canadian Psychology, 34, 407-414.