MEASURING TEACHER PERFORMANCE IN THE STATES
Struggling with the detail?
How do you fairly differentiate teachers into outcome categories
And how accurate are current models in rating teachers performance?
Evaluating Teacher performance is central to education reforms in the United States. How teachers are evaluated and measuring added value is the focus of on-going debate to ensure fairness, consistency and transparency. “Value-added” measures of performance, the average gains of pupils taught by a given teacher, instructional team, or school are often the most important outcomes for performance measurement systems that aim to identify instructional staff for special treatment, such as rewards and sanctions. There is also the thorny issue of how you categorise teachers, once you have measured their performance. Do you place them into outcome categories and if so how many? For example, are they rated highly effective, effective, developing, ineffective, etc. Many states have already designated four or five categories. Those pushing for a minimum number of outcome categories believe that teacher performance must be adequately differentiated, a goal on which prior systems, most of which relied on dichotomous satisfactory/unsatisfactory schemes, fell short. In other words, the categories in new evaluation systems must reflect the variation in teacher performance, and that cannot be accomplished when there are only a couple of categories. The number of categories a teacher evaluation system employs should, of course, depend on how well it can differentiate teachers performance with a reasonable degree of accuracy. Given that individual teachers and schools can be subject to significant consequences on the basis of their value-added estimates, researchers have increasingly paid attention to the precision of these estimates. And this is where it can become problematic. If it was accepted that one model of measuring value added did so with a considerable degree of accuracy over time and was absolutely fair and not subject to random results then the task would be pretty straightforward. But that is not how things stand. There can be random differences across classrooms in unmeasured factors related to test scores, such as pupils abilities, background factors, and other pupil -level influences and, secondly, what has been described as ‘ idiosyncratic’ unmeasured factors that affect all students in specific classrooms, such as a barking dog on the test day, or a particularly disruptive student in the class on the day. Existing research has consistently found that teacher- and school-level averages of student test score gains can be unstable over time. Studies in the States have found only moderate year-to-year correlations—ranging from 0.2 to 0.6—in the value-added estimates of individual teachers (McCaffrey et al. 2009; Goldhaber and Hansen 2008) or small to medium-sized school grade-level teams (Kane and Staiger 2002b). As a result, there are significant annual changes in teacher rankings based on value-added estimates.
A report for the US Department of Education ‘ Error Rates in Measuring Teacher and School Performance Based on Student Test Score Gains’ (July 2011) found that there is ‘evidence that value-added estimates for teacher-level analyses are subject to a considerable degree of random error when based on the amount of data that are typically used in practice for estimation.’ It also said that evidence suggests ‘that more than 90 percent of the variation in student gain scores is due to the variation in student-level factors that are not under control of the teacher’
So if this is the case is it realistic or fair to place teachers into five different outcome categories? It may be possible under existing models for measurement to differentiate the performance at the top and bottom of the distribution but is it precise or accurate enough to differentiate clearly between the bulk of teachers in the middle of the distribution? There must be some doubt about this even if you factor in ‘observation’ of teachers work. (teachers performance evaluation doesn’t rely entirely on value added measurement) It is worth repeating what the NFER in the UK said in a paper in 1999 when debate on added value was really beginning in earnest here- ‘What value added data cannot do is prove anything. Value added evidence is only part of the story of school effectiveness. The notion of a value added measure which tells you – and everyone else – how well your school or department or class is doing, and is also simple to calculate, understand and use, is a non-starter’.