In 2008, the Ohio Department of Education (ODE) released data showing that more than 80 percent of Ohio schools achieved “below expected growth” in fifth-grade reading. A year later, ODE data showed that 98 percent of schools made “above expected growth” in sixth-grade reading. What’s going on here?
What could possibly have accounted for such a spike in school performance from one year to the next? Well, nothing really, except a flawed system for measuring student progress. What’s disturbing, however, is that administrators and policymakers are using these poorly calculated statistics to make real decisions that impact schools, teachers, and students.
Skewed curves and yo-yos
Although ODE first made value-added data available in 2007, state officials admitted at the time that the methodology needed some tweaking and improvements. While no consequences were attached to the first year of data, policymakers and educators eagerly awaited the official release of data in year two.
That year (2007-08 – see Figure 1) revealed that buildings were being classified in a statistically odd way. Some tests showed a disproportionately high number of schools with students achieving above expected yearly growth (green on the chart) and two tests showed disproportionately high numbers of schools falling below expected growth (red).
Figure 1
According to a normal statistical distribution (a bell curve), a high number of schools should achieve “average” growth (yellow), with smaller numbers of schools achieving below expected and above expected growth. The actual numbers are startling because most schools were not, contrary to what was expected, average.
But it was the 2008-09 data (Figure 2) that showed just how wild the variations in data could get. Whereas 84 percent of schools attained below expected growth in fifth grade reading the prior year, 98 percent of schools were suddenly able to produce above expected results in the sixth grade. Further still, none of the schools produced below expected results.
Figure 3 shows this “yo-yo effect” of producing below expected gains one year, and then dramatically higher gains the next (or the reverse).
Figure 2
Figure 3
For a single classroom, school, or even district, such dramatic swings are plausible. However, it is far less plausible that all fifth-grade teachers and students in the state had a very bad school year in 2007-08, or that all sixth-grade ones had a great year in 2008-09. Even if Ohio had a single curricular scope and sequence, along with a single set of instructional materials, you would still not see that type of dramatic variation in student growth. Some teachers and their students would have to do better or worse in terms of growth.
The phenomena depicted in this data, then, must be the result of the tests themselves, and the method used to calculate the level of growth, which relies on vertically and horizontally aligned tests for every subject every year.
One would expect that achievement for reading in a given year to be roughly the same, unless the state made a coordinated effort to emphasize one subject over another. (Note, in looking at math gains over the same period of time, one also sees a yo-yo effect from year to year, but the differences aren’t as dramatic as in reading.) In this case, the instruments and methods used are partially to blame for showing strong differences in gains between these two years. The state is counting on the baseline year (2006-07) to be consistent across tests horizontally and vertically, which is not dependable for a test this new. Ohio is also relying on the standard deviation from the baseline year. Instead, if the measure utilized the standard deviation of each year’s test, there would be consistent proportionality in the distribution of above, at, and below a year’s expected growth.
Implications for accountability policies
These results show schools dramatically alternating between below expected and above expected, and there is a danger in making decisions based on data that will simply change later, even though teaching and learning has not. Such results should be a red flag that the method of value-added analysis in Ohio is unreliable and might not form a solid foundation for decision-making.
The real danger is that these results are used for high stakes accountability decisions. Schools with falling achievement scores are spared consequences because their value-added data shows them making strong yearly gains. Conversely, schools with low gains on their value-added data can be penalized because this data show them making minimal yearly gains even though they may have strong overall achievement scores. Certainly some of the results are accurate, but there are surely some “false positives” and “false negatives” in the school classifications. These false positives and negatives could lead decision makers to erroneous conclusions about school performance.
Consider the following – one large urban district was confident in its belief that despite a three year decline in overall achievement scores, the growth measures validated its academic strategies and administrators planned to move forward knowing the growth would eventually result in improved overall student achievement. That may happen, or it may not if the value-added model is of a poor statistical design. Without the life preserver of the value-added provision of Ohio’s school accountability system a number of charters would have been forced to close in the last year. A number of schools (district and charter) rated Academic Emergency by the state would also have faced sanctions. Bad data may have delayed needed reforms for many Ohio students.
Additionally, seeking above expected growth from every student and every school every year is problematic. The goal should be one year of expected growth for one year of teaching. The exception would be an expectation of greater than one year’s expected growth for those students who are far behind their peers, and have received intervention services. Value-added measures are precisely the method that should be used to evaluate the effectiveness of intervention services for low-achieving students. For students on grade level, however, attainment of that goal should not be designated as a cautionary yellow. It is what should be reasonably expected of students and good teaching.
A proposed fix
The yo-yo effect described here strikes at the credibility of Ohio’s value-added measurements, and the lack of clarity surrounding how to calculate them is problematic. Fortunately, there is an alternative.
A basic z-score analysis (which re-standardizes results yearly based on average growth statewide) is a simpler, more transparent way of measuring growth. I have been using this method to analyze some school districts for seven years in order to identify areas of strength and weakness, to evaluate gifted and intervention services, and to plan for staff professional development initiatives.
There are added advantages to this method. It is based on data publicly available from the state, and it can be done on any laptop with a spreadsheet program like Microsoft Excel. Districts down to individual classroom teachers could see how it operates and use it for their own evaluative purposes. In short, it does away with the skepticism for the proprietary black box method of determining “expected” gains that the state currently employs.
More importantly, a state-standardized growth measure would correct the disproportionate results currently skewing the classification of students and schools. In order to assess growth, there has to be a standard of some kind. (It must be noted that with a normal concept of growth on a statewide scale, it would be impossible for a majority of districts to attain the above expected growth label.)
It would be wise for Ohio to redo this year’s value-added classifications with a simple, z-score (standard score) methodology and report the differences to schools for their internal planning. Eventually moving to such a system could also save the state money by running the analysis in-house and would shore up the credibility of the system among its critics. Most importantly, decision makers – from state policy makers to classroom teachers – would not be receiving information that could lead them to making poor decisions.
by Douglas A. Clay
Dr. Douglas A. Clay currently serves as the assistant director for assessment and accountability of Cleveland State University’s campus of the Reading First-Ohio Center for Professional Development and Technical Assistance for Effective Reading Instruction. He has designed and managed data collection systems as well as professional development in making student data actionable for leadership in the Cleveland Metropolitan School District as director of student assessment. He has extensive experience in multivariate data analysis techniques, teacher effectiveness measures, and evaluation models. Dr. Clay has acted as external evaluator and consultant on numerous federal, state, and regional evaluations of professional development for school improvement efforts. These opinions are Dr. Clay’s and do not necessarily reflect the views and opinions of the Ohio Education Gadfly or the Thomas B. Fordham Institute.