Struggling with the detail?

How do you  fairly differentiate teachers into outcome categories

And how accurate are current models in rating teachers performance?


Evaluating Teacher performance is central to education reforms in the United States. How teachers are evaluated and measuring added value is the focus of on-going debate to ensure fairness, consistency and transparency. “Value-added” measures of performance, the average gains of pupils taught by a given teacher, instructional team, or school are often the most important  outcomes for performance measurement systems that aim to identify instructional staff for special  treatment, such as rewards and sanctions.   There is also the thorny issue of how you categorise teachers, once you have measured their performance. Do you place them into outcome categories and if so how many?  For example, are they rated highly effective, effective, developing, ineffective, etc. Many states have already designated four or five categories. Those pushing for a minimum number of outcome categories believe that teacher performance must be adequately differentiated, a goal on which prior systems, most of which relied on dichotomous satisfactory/unsatisfactory schemes, fell short. In other words, the categories in new evaluation systems must reflect the variation in teacher performance, and that cannot be accomplished when there are only a couple of categories. The number of categories a teacher evaluation system employs should, of course,  depend on how well it can differentiate teachers performance  with a reasonable degree of accuracy. Given that individual teachers and schools can be subject to significant consequences on the basis of their value-added estimates, researchers have increasingly paid attention to the precision of these estimates. And this is where it can become problematic. If it was accepted that one model of measuring value added did so with a considerable degree of accuracy over time and was absolutely fair and not subject to random results then the task would be pretty straightforward. But that is not how things stand. There can be random differences across classrooms in unmeasured factors related to test scores, such as pupils  abilities, background factors, and other pupil -level influences and, secondly, what has been described as ‘ idiosyncratic’  unmeasured factors that affect all students in specific classrooms, such as a barking dog on the test day,  or  a particularly disruptive student in the class on the day. Existing research has consistently found that teacher- and school-level averages of student test score gains  can be unstable over time. Studies in the States  have found only moderate year-to-year correlations—ranging from 0.2  to 0.6—in the value-added estimates of individual teachers (McCaffrey et al. 2009; Goldhaber and  Hansen 2008) or small to medium-sized school grade-level teams (Kane and Staiger 2002b). As a result, there are significant annual changes in teacher rankings based on value-added estimates.

A report for the US Department of Education ‘ Error Rates in Measuring  Teacher and School Performance  Based on Student Test Score  Gains’ (July 2011) found  that  there is  ‘evidence that value-added estimates for teacher-level analyses are subject to a considerable degree of random error when based on the amount  of data that are typically used in practice for estimation.’   It also said that evidence suggests  ‘that more than 90 percent  of the variation in student gain scores is due to the variation in student-level factors that are not under control of the teacher’

So if this is the case is it realistic or fair to place teachers into five different outcome  categories? It may be possible under existing models for measurement to differentiate the performance at the top and bottom of the distribution but is it precise or accurate enough to differentiate clearly   between the bulk of  teachers in the middle of the distribution? There must be some doubt about this even if you factor in ‘observation’ of teachers work. (teachers performance  evaluation doesn’t rely entirely on value added measurement) It is worth repeating what the NFER in the UK said  in a paper in 1999 when debate on added value  was really beginning in earnest here- ‘What value added data cannot do is prove anything. Value added evidence is only part of the story of school effectiveness. The notion of a value added measure which tells you – and everyone else – how well your school or department or class is doing, and is also simple to calculate, understand and use, is a non-starter’.



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s