Reflections on AERA’s Warning About Value-Added Models

Written by: David Casalaspi

Primary Source: Green & Write, November 25, 2015

On November 11, the Leadership Council of the American Educational Research Association (AERA), the nation’s largest professional association of education researchers, released a statement admonishing researchers and policymakers against the excessive use of value-added models (VAMs) in high-stakes evaluations of teachers, principals, and educator preparation programs. Coming on the back of a special issue of AERA’s academic journal Education Researcher which focused exclusively on test-based evaluation methods, the Council insinuated that policymakers and analysts had begun to overreach, racing to apply VAMs to situations in which they are not always appropriate.

VAMs have become popular in recent years in part because they ostensibly provide greater insight into what students learn over time than do status measures of achievement, such as proficiency levels. While status measures provide only a snapshot of student achievement (as measured against an objective standard of proficiency) and do not take into consideration students’ academic starting points, VAMs are designed to capture students’ academic gains over the course of the year and isolate the effects of individual teachers and schools on student learning. (More information about the differences between VAMs and status measures can be found here).

Despite their technical sophistication, however, VAMs still provide a circumscribed view of performance, and like all statistical measures, they contain a certain amount of error. Moreover, as evaluators move farther away from the original purpose of the metric – estimating student growth on standardized exams – the risk of these errors compounding grows, which in turn can lead to the misclassification of teachers, principals, and educational programs for accountability purposes.

In recent years, researchers, analysts, and policymakers have striven to apply VAMs to evaluations of teachers, principals, and even teacher preparation programs. According to the Council, this is an overreach because “VAM estimates have not been shown to isolate sufficiently the effectiveness of teachers, principals, or other non-teaching professional staff.” The Council then directed attention to some of the “scientific and technical limitations” of VAMs and outlined eight stringent requirements that should be met before analysts use VAMs to evaluate educators. Among them, VAMs should only be used when the tests meet professional standards of reliability and validity; when they draw on data from a large number of students over many years; when the tests are proven to be consistent over many years; and when they are used in conjunction with other measures.

Additionally, the AERA Council insisted that VAMs and the accountability systems in which they are embedded should be made more transparent. Researchers and policymakers using VAMs should make public the rationale for the methods used in their calculations, the estimates of their errors, and the estimated reliability of their measures. The Council concluded by calling for substantial investment in improving VAMs and creating alternative models of student growth.

The Worst Form of Measurement, Except for All the Others

While there has been growing unease about the excessive use of test score data in education evaluation as of late, not everyone received the Council’s statement warmly. Michael Hansen and Dan Goldhaber, two well-regarded economists of education, authored a response to the AERA statement on the Brown Center Chalkboard. After acknowledging the Council’s concerns about imprecision, Hansen and Goldhaber chided the group for being too utopian in its assertions, writing: “One must view the value of any particular performance measure in the context of all other measures, not relative to a nirvana that does not exist.” While VAMs do have many technical limitations, the authors admitted, they nevertheless remain the most valid measures out there, and it is important not to “let the perfect be the enemy of the good.”

8144344750_f12c05ba02_o

Photo Courtesy of AJCann.

Hansen and Goldhaber have a point. The current alternative to VAMs is an indisputably inferior status quo. When evaluating teacher performance, for instance, the status quo involves infrequent, perfunctory classroom observations that fail to differentiate the best from the worst teachers. Evaluations of teacher preparation programs are just as defective, usually entailing nothing more than a review of curricular materials (i.e. course syllabi) and sporadic interviews with professors. Yet, these facts still do not answer the question as to whether or not VAMs provide sturdy enough grounds for high-stakes decisions about the pay and job security of educators. Hansen and Goldhaber believe that they do, arguing that research has demonstrated that the reliability of VAMs is consistent with performance measures already used in other occupations about as complex as teaching – although they did not provide specific examples.

Is Policy Change on the Horizon? 

The AERA statement coincides with other important testing and evaluation policy developments. The Obama Administration has softened its stance on standardized testing in recent weeks, saying that testing should be limited to no more than 2% of instructional time. And in Congress, a plan to reauthorize No Child Left Behind (NCLB) does not mandate specific student goals or accountability frameworks. Given this policy environment, it is conceivable that the Council’s statement will work to further decelerate the decade-long rush to evaluate educators based on student test scores.

At the same time, however, there is good reason to believe that the statement will have only limited impact on the overall trajectory of educational evaluation. Last year, the American Statistical Association expressed similar concerns in a statement about the use of VAMs in high-stakes decisions, citing issues about the precision of value-added scores and questioning their power to provide a causal estimate of educator effectiveness. Though this statement also received significant attention at the time, VAM-based evaluation systems continued to proliferate and become institutionalized in federal, state, and local policies.

Additionally, the states and localities that have adopted VAMs as part of their evaluation systems have already expended considerable resources to develop the capacity to collect, store, and analyze student data. Given this commitment of time, money, and political capital, it seems unlikely that those who have already adopted VAMs will significantly change course now. Path dependency is at work, and in the world of public policy, Willy Wonka’s revelation that “You can’t get out backwards. You have to go forwards to go back, so you better press on” appears particularly apropos.

As such, what seems more likely to occur in the wake of the AERA statement are minor tweaks and modifications to evaluation systems as policymakers learn about the operation of VAMs in actu. Indeed, many of the weaknesses of VAMs might only be discovered once they are implemented in the field and forced to interact with other policies. For instance, as new Common Core-aligned tests take hold, attention will need to be paid to how accurately these exams capture what students know and how amenable they are for VAM-based evaluations. Additionally, as resources are shifted away from testing in an imminent post-NCLB world, evaluators may be forced to reassess the feasibility and generalizability of some test-based metrics. Also, in light of recent legal suits against the use of VAMs in teacher evaluation and employment decisions in Washington DC, New Mexico, Florida, New York, and Houston, policymakers will soon have to determine if the risk of error and misclassification is worth the political, legal, and financial costs. As these and other experiences accumulate, any limitations of VAMs (both theoretical and practical) will become more apparent and changes might be made.

In short, the Council’s statement is an important reminder that all stakeholders – policymakers, advocates, researchers, and educators – will need to be involved in the development of more mindful evaluation frameworks which can produce accurate information about educator performance and inform efforts to improve student learning.

 

Contact David: dwc@msu.edu

Contact Jason: burnsja6@msu.edu

The following two tabs change content below.
David Casalaspi
David Casalaspi is a third-year student in the Educational Policy Ph.D. Program. Before beginning his graduate studies, he attended the University of Virginia, where he received his B.A. in History and spent his senior year completing a thesis on the rise of federal accountability policy between 1989 and 2002. Additionally, while at UVA, David designed and taught a two-credit seminar for undergraduates on the political history of the American education system and also received some practical experience with policymaking through work with the City Council of Charlottesville, VA. His current research focuses on the politics and history of education, and particularly the way that education rhetoric and issue framing efforts affect the implementation of school reforms.