NOTE: The Thomas B. Fordham Institute occasionally publishes guest commentaries on its blogs. The views expressed by guest authors do not necessarily reflect those of Fordham.
There is one goal that all Ohioans share: We want our children to have a high-quality education. But pursuing this goal requires that we measure “quality” correctly. Without a valid measure, parents cannot choose the best educational options for their kids, state and local policymakers cannot properly monitor and govern the schools they oversee, and educators cannot improve their instruction. A bad measure of educational quality can lead parents, policymakers, and educators to inadvertently undermine students’ learning and, ultimately, their life chances.
Measuring school quality is tricky. One must somehow disentangle a school’s contribution to student learning from all other factors over which schools have little influence, such as a student’s motivation and ability, parental involvement and resources, and peer influence. Average proficiency rates at the school or district level are largely attributable to these outside influences. Consequently, praising a school for high proficiency rates is akin to praising it for being wealthy, and disparaging a school for low proficiency rates is akin to punishing it for serving students from poor households. In fact, it is common for schools with low proficiency rates to be of higher quality—in that they impart more knowledge and skills during the school year—than schools with high proficiency rates.
Fortunately, there is a large body of rigorous research that provides clear direction on how to measure school quality. This research indicates that measuring student test score growth from year to year allows one to isolate teacher, school, and district contributions to student learning, or “value added.” For example, one study estimated school value added in this way and sought to determine whether it truly captured the educational effectiveness of those schools. To check, the researcher took advantage of a policy that randomly assigned students to schools. If the value-added estimates were valid, then each of those randomly-assigned students would improve their test scores in math and reading by the amount predicted by each school’s value-added estimates from the prior year. That is exactly what the author found. Moreover, research shows that students with higher-quality teachers—as measured by value-added achievement-growth estimates—experience superior life outcomes.
The bottom line is worth reiterating: Rigorous research indicates that, if estimated properly, value-added measures capture school contributions to student learning, and such test score gains correspond to better real-world outcomes later in life.
The good news is that Ohio has long incorporated value-added measures in its school accountability system, and research indicates that its underlying value-added calculations have significant strengths. For example, by accounting for multiple prior years of student test scores, Ohio’s value-added estimates should more accurately and precisely capture school quality than estimates that fail to do so. Unfortunately, the way that value-added estimates are incorporated into Ohio’s school and district report cards significantly undermines these strengths—leading to year-to-year volatility in school and district grades, which understandably confuses stakeholders and weakens their confidence. Let’s fix that problem.
Separating student achievement growth from “margin of error”
The statistical model that Ohio uses to calculate school value added is technical, but the underlying logic should make sense to those familiar with political polling. When an opinion survey asks voters which candidate they are likely to support in the next election, pollsters typically report two quantities: (1) a candidate’s lead in percentage points and (2) whether the lead is within the “margin of error.” The second component accounts for the inherent uncertainty in any statistical calculation. For example, with a margin of error of plus or minus five points, a candidate ahead by just two points will not rest on her laurels because she will not have much confidence that her lead is real.
Value added works the same way. For each school and district, the state constructs a measure of achievement growth that tells us whether the average student demonstrated more or less than a year’s worth of learning, as well as a “margin of error” for that estimate. If the calculation indicates that, on average, a school’s students demonstrated more than a year’s worth of learning, but the difference is within the margin of error, we can’t be certain that these students really outperformed their peers attending other campuses.
Ohio’s value-added system relies on both the raw growth calculations (known as “gain scores”) and their associated margins of error. The gain score tells us the difference between the average student’s test score growth and a baseline of expected growth. The margin of error tells us whether the gain score is statistically significant—that is, whether this difference is unlikely to be due to chance. As originally implemented in 2008, when the gain score was within the margin of error—that is, when the difference between student achievement growth and expected growth was not statistically different from zero—Ohio report cards would label schools or districts as having “met” the growth target. Those with negative gain scores beyond the margin of error were labeled as “below” the growth target, and those with positive gain scores beyond the margin of error were deemed “above” the growth target. These designations were meant to identify schools for which we had strong evidence that their students were falling behind—those whose negative “gain scores” were unlikely to be due to random measurement error—and to recognize those posting impressive growth.
Although report cards did not indicate what the actual gain scores were (more on that in a bit), this methodology was statistically sound. The value-added metric became somewhat problematic, however, when lawmakers replaced the “met,” “below,” and “above” value-added designations with five categories corresponding to letter grades A through F. They based these letter grades on different levels of statistical significance—essentially, by defining different thresholds for calculating the margin of error. Schools and districts received A’s if the statistical model indicated great confidence that they had gain scores greater than zero (akin to the “above” designation) and B’s if the model indicated less confidence that a school or district exceeded expected gains. D and F designations were based on similar thresholds for schools and districts with negative gain scores, and a C designation went to schools and districts with student achievement growth that was not statistically different than expected.
There are multiple problems with this change. One problem is due to how stakeholders interpret letter grades. The implication is that a school with an A imparts more knowledge than one with a B. But that is not necessarily the case. For example, a school might receive a B simply because it has fewer students—a smaller sample size—than the school that received an A, even if students in the former actually posted higher growth. Another problem is that switching from three to five performance categories meant that the difference in the error thresholds became quite small—particularly for D’s and B’s. Thus, it became likely that from year to year, by random chance, a school or district bounced from A to C, or from C to F.
These problems are serious when one considers the stakes associated with report card grades. Note that these issues are not a consequence of the underlying value-added methodology, which remains strong, but simply due to how the scores it produces are translated into the categories reported on the school and district report cards.
This year’s state budget makes major changes to the grades but retains the essential limitations of the A through F grading, including its reliance on arbitrary statistical significance thresholds. Instead of addressing the root of the problem, the updates simply change how specific margins of error map onto test scores. The changes allow districts and schools to receive higher report card grades with no change in actual performance—for example, allowing those with negative growth to receive a B and giving a C to districts and school buildings that would have received the lowest “below” designation under the old classification system.
Make value added valuable
We have two general recommendations for improving the value-added metric.
First, we recommend that report cards provide the actual gain score that indicates how much achievement growth a school or district’s students experienced relative to what is expected in each grade, as opposed to merely reporting the statistical significance of that growth. That would be far more informative for all stakeholders—such as parents, teachers, administrators, and other interested citizens—and it would put far less weight on arbitrary benchmarks of statistical significance. This would be a more transparent system—particularly if the estimates are translated into a more intuitive metric, such as percentiles.
Second, we recommend that accountability-related assessments of the reported growth estimate be based on large student samples and a high benchmark for statistical significance. These changes should lessen the extent to which measurement error affects stakeholder conclusions about school quality. This option could entail simply going back to the original three-category methodology, in which schools and districts are judged as “above,” “met,” or “below” typical growth based on whether the gain scores were outside of the margin of error.
Ohio could also combine these recommendations. Although this might be too much detail for some readers at this early stage, what we have in mind is integrating the raw gain scores and their margins of error into a single metric, essentially shrinking growth estimates toward zero when they are estimated less precisely. A number of individual school districts already do this for their teacher evaluation systems, and Ohio’s own teacher evaluation system already incorporates one form of such statistical “shrinkage.”
Whatever policymakers decide to do, we recommend reporting the gain score using an intuitive metric (as opposed to merely a letter grade) and to link it to report card grades in a more precise way. For example, in addition to presenting the gain scores in terms of percentiles, report cards could indicate whether schools “met” or “exceeded” expected achievement growth based on a stringent benchmark for statistical significance. In turn, schools that met or exceeded this benchmark could receive an overall report card grade no lower than a C. Such a strategy would provide more useful information (the gain score), minimize volatility in overall report card grades, and avoid the punishment of effective schools that serve disadvantaged students.
To be sure, each approach we have reviewed has its own set of tradeoffs. It’s important that lawmakers make further changes carefully and deliberatively, anticipating the unintended consequences. But it is also clear that a change in the reporting and grading system is needed. Having a value-added system that ignores the actual growth scores and relies on arbitrary statistical thresholds can undermine the efforts of parents, policymakers, and educators who work so hard to deliver a high-quality education to our children. It is rare that rigorous research offers such clear policy guidance, but that is the case with value added. We know that measuring school quality in this way is valid—that it captures school contributions to student learning, and that this learning is predictive of life outcomes. Let’s take advantage of that knowledge.
Vladimir Kogan is an associate professor in The Ohio State University’s Department of Political Science and (by courtesy) the John Glenn College of Public Affairs. Stéphane Lavertu is an associate professor in The Ohio State University’s John Glenn College of Public Affairs. The opinions and recommendations presented in this editorial are those of the authors and do not necessarily represent policy positions or views of the John Glenn College of Public Affairs, the Department of Political Science, or the Ohio State University.