Written by: Jason Burns
Primary Source: Green & Write
Image courtesy of Wikimedia Commons.
Though the Every Student Succeeds Act (ESSA), the successor to No Child Left Behind (NCLB), has a reduced emphasis on accountability, the debate over how to evaluate teachers rages on. While there is little disagreement over the importance of providing all students with an effective teacher, there are markedly different views of how to differentiate between stellar, average, and poor teachers. To date, much of the evaluation discussion has focused on the use of test-based measures like value-added modeling (VAM), namely on their potential shortcomings. However, recent research points out that some of the same issues that characterize VAM also apply to teacher observations, which are the largest component of most teacher evaluation systems. This suggests that in addition to seeking out issues with new approaches, we would also be wise to examine the system we have now.
Briefly, value-added modeling (VAM) estimates how much a teacher has helped individual students grow over the course of an academic year. It is a statistical technique that uses student information such as their background and prior achievement to predict later achievement, namely the current-year test score. If the students assigned to a particular teacher exceed their predicted level of achievement, this gain is then attributed to the teacher’ skill (see here for more overview). In theory, these gains that can be attributed to teachers can then be used to compare the impact that teachers have on their students’ learning. This has made VAM appealing to education reformers who see it as a tool with which to improve teacher quality by tying it to teacher evaluations. However, research has identified several issues with VAM.
One issue is that VAMs are often “noisy (see here and here).” VAM scores can vary significantly depending on which model is used, meaning that it can be difficult to make meaningful distinctions between most teachers. With VAM, one can determine the best and worst teachers, but it is hard to distinguish teachers in the middle of the teacher quality distribution.
Research has also pointed out that VAMs can be “unstable,” meaning that teachers ranked near the top in one year may be rated significantly worse in later years even when they don’t significantly change their practice.
Third, VAM ratings appear to be influenced by how students are assigned to teachers. In other words, teachers’ VAM scores may be influenced by the types of students assigned to them instead of only by their teaching ability.
Because of these issues, many see the potential of VAM to mislabel effective teachers as ineffective and therefore oppose the use of VAM in teacher evaluations. This is a valid concern, but at the same time, these issues must be compared against the existing evaluation system and research is beginning to inform this comparison.
Teacher observations are generally done by a building administrator such as the school principal and involve that person observing a teacher deliver instruction for some period of time. During these observations, the administrator typically uses a predetermined rubric to score the teacher’s performance in facets of teaching that may include classroom management, content knowledge, or student engagement among many others. In theory, examining teachers’ conduct in their classrooms permits an evaluation that considers all aspects of their work and performance. However, observations may have some of the same flaws as VAMs.
Like VAMs, observations also tend to obscure differences between teachers. A report by The New Teacher Project famously found that over 98% of teachers are rated as effective and that only 1% are given an unsatisfactory rating. In other words, most evaluation systems consider the vast majority of teachers to have a similar level of effectiveness, which does not align with the observed variation in teacher quality.
Interestingly, new research shows that observation scores are also influenced by student sorting-how students are assigned to teachers. In a recent study, researchers found that teachers received more favorable observation scores if their students had higher achievement the prior year. This is problematic because how students performed in earlier years should not be related to how their teacher is rated in the current year and the presence of this relationship means that teachers may be undeservingly rewarded for teaching higher achieving students.
These issues with observations are also a cause for concern. It has long been documented that teachers have an important impact on their students and that there are real differences in teacher quality. Unfortunately, though, observations-which comprise the bulk of most teachers’ evaluations-fail to capture the important differences that exist across teachers.
Teacher evaluation, done right, can serve a valuable purpose by providing data that can be used to help teachers improve their practice, to reward or recognize the best teachers, and to inform staffing decisions. However, as highlighted by the research in this post, there is no simple answer to the question of how to accurately and fairly evaluate teachers.
Ultimately, it seems to come down to which type of error one is more comfortable with. On the one hand, approaches like VAM can be used to rank teachers, but in most cases doesn’t provide the confidence to make consequential decisions. On the other hand, though, administrator observations avoid using comparisons that have little meaning, but fail to differentiate at all among the vast majority of teachers.
Though they may be disheartening, these issues do not rule out progress on the issue. In fact, this debate may help evolve our understanding of teacher evaluation methods. As research makes clearer the issues associated with different approaches to teacher evaluation, policymakers at the state and local levels should consider the benefits and limitations of different frameworks to establish an evaluation framework that provides meaningful data about teachers’ performance and minimizes the risk of mislabeling teachers.