How stakes affect the accuracy and efficacy of teacher ratings

Amber M. Northern, Ph.D.

8.2.2017

The summer edition of the first-rate Education Finance and Policy Journal examines whether principals really think that all teachers are effective, especially since we know from prior studies that upwards of 98 percent receive positive evaluations. Supplementing 2012 administrative data from Miami-Dade, the fourth largest district in the U.S., Jason Grissom and Susanna Loeb ask roughly one hundred principals to rate a random handful of their teachers on different dimensions of practice. Importantly, they let the principals know that these are low-stakes ratings, in that only researchers would know the scores that they gave. The hypothesis was that without any stakes attached they might give more candid appraisals. These ratings were later compared to the high-stakes, summative personnel ratings (i.e., the Instructional Performance Evaluation and Growth System, or IPEGS) that principals gave those same teachers a few weeks later.

Analysts found that both sets of evaluations were quite positive, but the low-stakes evaluations tended to be more negative. Indeed, many teachers who were rated “ineffective” on the low stakes measures received “effective” or “highly effective” ratings on the high-stakes measures. Still, even though the official ratings skewed to the high side, teachers receiving the highest of the high scores (“highly effective” versus “effective”) were indeed more effective, according to student achievement growth. Finally, analysts also found that principals systematically gave better-than-predicted ratings (according to the interview data) to beginning teachers and worse-than-predicted ratings to both teachers who were absent more and, in some cases, to teachers of color—though they can’t say why.

In short, the study shows that principals can indeed distinguish between higher and lower performing teachers simply by differentiating at the high end of the scale. Principals also appear to face strong pressures to skew ratings in high-stakes settings. Making more use of the lower categories, analysts recommend, would facilitate more accurate feedback to teachers and potentially provide greater incentives for low performers to improve. We’ve certainly seen in Washington, D.C., for instance, how evaluations can also make it more likely that struggling teachers exit the system.

Still, just as prior research has shown, we’ve got to pay attention to the mechanics of evaluations. For instance, having a six-point scale on the low-stakes measure may have made principals more comfortable using lower ratings than the four-point scale on the high-stakes instrument. Moreover, in the end, we have to trust principals to do their jobs, while not being naïve about the logistical and relational aspects of all of this. Recognizing that low ratings trigger more paperwork and headache—and that you likely can’t terminate a low-performing teacher anyway—helps explain why a principal might say that one teacher is magnificent while another is just terrific.

SOURCE: Yujie Sude et al., “Supplying choice: An analysis of school participation in voucher programs in DC, Indiana, and Louisiana,” School Choice Demonstration Project, University of Arkansas, and Education Research Alliance New Orleans, Tulane University (June 2017).