Four approaches to ESSA accountability
By Michael J. Petrilli and Brandon L. Wright
By Michael J. Petrilli and Brandon L. Wright
Though it sometimes appears that Education Secretary John King didn’t get the memo, the Every Student Succeeds Act (ESSA) represents a significant devolution of authority from the federal government to the states. This is a praiseworthy development that, in our view, better fits America’s constitutional principles of federalism and opens up many areas of education policy for innovation and improvement.
That devolution includes the heart of ESSA: school-level accountability. States now enjoy a freer hand to decide how they want to rate (or “grade”) their schools and determine which are worthy of either praise or aggressive intervention. The new law doesn’t give states carte blanche; they can’t move away from student achievement as a major indicator of quality, for example. But they certainly have more leeway than under No Child Left Behind.
So what forms might—and should—this take? How might states approach the particular challenge of redesigning their accountability systems? The contestants in our “accountability design competition” in February surfaced ideas aplenty and made many promising suggestions. With a few months of reflection on them, we see that there are competing camps or worldviews when it comes to ESSA accountability (much as there are regarding school choice). We see four such factions forming. Let’s identify them by their slogans:
1. Every School is A-OK!
2. Attack the Algorithms
3. Living in the Scholars’ Paradise
4. NCLB Was Extended, Not Ended
Let’s take a look.
Every School is A-OK!
Proponents of this model—the teachers’ unions and other educator groups—fundamentally abhor results-based accountability. They hate it when state officials give their schools black eyes or low marks for not meeting targets that they view as arbitrary and beyond their control. They’d rather get rid of testing and accountability altogether, but since they can’t quite pull that off, they want to at least create a system that depicts schools in the best possible light. Look for them to push for systems in which schools could get good ratings for either high proficiency rates or strong growth; to embrace squishy "other indicators of student success or school quality” (such as “teacher engagement”) and make those indicators count for as much as possible; and to lobby for school categories that all sound positive. (Nebraska’s performance levels might be a model: Cornhusker schools can earn ratings of Excellent, Great, Good, or Needs Improvement.)
Attack the Algorithms
This approach is also skeptical of test scores, but not of accountability per se. It seeks a system that uses as much human judgment as possible and captures a full, vivid, multifaceted picture of school quality. (Indeed, it resembles what’s called “qualitative research” at AERA conferences.) At its core is the school inspection: Experts visit schools to conduct stakeholder interviews, observe classrooms, administer surveys, and more. The results of such inspections would count for as much as ESSA allows. Systems like this are rare in this country, but they’re a significant part of education accountability in some European countries and much of the British Commonwealth. And they aren’t unlike what the best charter authorizers already do when their schools are up for renewal.
Living in the Scholars' Paradise
This approach uses sophisticated, rigorous models to evaluate schools’ impact on student achievement, making sure not to conflate factors (like student demographics or prior achievement) that are outside of schools’ control. It also seeks to avoid the perverse incentives that were baked into NCLB, especially a narrow focus on “bubble kids” just above or below the proficiency line. This would result in a system that is maximally fair, and it encourages schools to help all students make as much progress as possible over the course of the school year. The Scholars’ Paradise model would use “scale scores” or a “performance index” for the “academic achievement” indicator; measure growth using a two-step value-added metric; pick robust “indicators of student success or school quality,” such as chronic absenteeism; and make value added count the most in a school’s final score. (If you think that sounds a lot like the model proposed by uber-scholar Morgan Polikoff and his colleagues at our ESSA Accountability Design Competition, you are not mistaken.)
NCLB Extended, Not Ended
NCLB is gone but not forgotten. Or maybe it’s not exactly gone, in the mind of folks who yearn for Uncle Sam to mandate accountability models that obsess about achievement gaps and give failing grades to any school with low proficiency rates for any subgroups. Under the NCLB Extended approach, embraced by many on the education reform/civil rights Left, achievement would continue to be measured by proficiency rates alone (with rising annual goals for what is good enough); growth data would be used sparingly and/or focused on “growth to proficiency”; “other indicators of student success or school quality” would be minimized; and evidence of achievement gaps would sink schools’ ratings significantly. NCLB rides again.
***
In our view, Scholars’ Paradise has a lot going for it. Its focus on fairness should mean greater buy-in from educators; its ability to differentiate between high-growth and low-growth schools makes it effective at signaling to policy makers which campuses deserve praise and which need major overhauls. And it focuses equally on all kids regardless of their achievement levels. That also seems like the fairest and smartest approach to us. Yes, it’s wonky—maybe too wonky to be understood by most parents, educators, and policy makers. But we don’t know how our smart phones work either; that doesn’t mean we don’t love them.
Unfortunately, John King’s proposed regulations would make parts of this model illegal. That’s because our friends at the Department of Education read ESSA’s language to mean that proficiency rates—and proficiency rates alone—must be the sole measure of “academic achievement.” We believe that the department’s famously smart lawyers could find plenty of wiggle room and allow states to use an Ohio-style index to give partial credit to schools for getting kids to basic (and additional credit for getting them to advanced). Here’s hoping they wiggle before these regulations are finalized.
Attack the Algorithms also shows promise. This approach puts stock in people’s ability to identify and adjust for nuances in ways that quantitative models can’t. For it to work, however, states would have to ensure that inspectors are thoroughly trained, highly competent, impeccably impartial, and willing to differentiate between high- and low-performing schools. To be permissible under ESSA, they would also need to report findings for schools’ subgroups, not just the schools as a whole. But if we all agree that it’s insane to measure teachers based on test scores alone, why should we keep doing that for schools?
You can probably tell by now that we’re not so bullish on the other two approaches. “Every School is A-OK” is simply not truthful. Schools with low growth and low achievement, year after year, are not OK; they are brain-dead and in need of resuscitation or euthanasia. Mediocre suburban schools are not OK, and their “stakeholders” should know it. It’s clear that Secretary King and colleagues feel the same way; they are now working via regulation to bar this model.
On the other end of the spectrum, NCLB Extended would amplify the many problems that ESSA was meant to overcome: the utopian expectations, the sense that we’re setting up every school to fail, the narrow-minded focus on reading and math scores and kids just below or above the “proficiency” line. Turn the page, people. Turn the page.
***
The first state plans for implementing ESSA accountability are due in March. In the meantime, everyone should start designing their ESSA Accountability Camp t-shirts; the tug-of-war is about to begin!
My friend Tom Loveless is right about most things, and he’s certainly right that scoring “proficient” on NAEP has nothing to do with being “on grade level.” He’s also right that Campbell Brown missed this point.
But Tom, alas, is quite wrong about the value of NAEP’s trio of “achievement levels” (basic, proficient, advanced). And he’s worse than wrong to get into any sort of defense of “grade level,” as if that concept had any valid meaning or true value for education reform.
In his words, Tom’s post sought “to convince readers of two things: One, proficient on NAEP does not mean grade-level performance. It’s significantly above that. Two, using NAEP’s proficient level as a basis for education policy is a bad idea.”
We agree on the first point, not on the second—and not on his implicit argument that there is merit in basing education policy on “grade-level” thinking.
Unless one is talking about academic standards—Common Core or otherwise—or about the cut scores on high-stakes, end-of-year, criterion-referenced exams like PARCC and Smarter Balanced, “grade level” has no meaning at all. It’s a misnomer that we adopted during decades of using norm-referenced tests. These were “normed” such that the average score of, say, fifth graders taking the test was deemed to be “fifth-grade-level” work. But that score was simply the average achieved by kids enrolled in fifth grade. It had nothing to do with whether they had mastered the fifth-grade curriculum, had attained a fifth-grade standard, or were well prepared for academic success in sixth grade. It’s nothing more than the score attained by the average student.
That bit of folly led to decades of profoundly misleading reporting of academic performance, memorably skewered by a West Virginia psychiatrist named John J. Cannell in a 1987 study that swiftly became known as the “Lake Wobegon Report” (after Garrison Keillor’s mythical town where “all the women are strong, all the men are good-looking, and all the children are above average”). District after district was reporting to the public that most of their pupils were scoring “above average” or “above grade level”; on those tests, these meant the exact same thing. (And both are mathematically impossible.)
Standards are different. They’re aspirational statements of what students in a given grade should learn, even though we’re painfully aware that most don’t—at least not when the standards are rigorous and demanding enough that those who master them are truly on track for success in college or the job market.
As for cut scores: If correctly calibrated to signify readiness for academic success in the following grade—such as a 4 on the PARCC exam—they signify that the test taker has done “grade-level” work, properly understood.
But the percentage of American students getting a 4 or higher on PARCC is about the same as the percentage reaching “proficient” on NAEP (in the grades where NAEP is given)—which is to say, not nearly enough. In Maryland, for example, the 2015 PARCC scores generally showed 30–40 percent of students reaching levels 4 or 5. About the same percentages of the state’s fourth and eighth graders tested “at or above proficient” on NAEP that year in both reading and math.
Which brings us back to NAEP. When the achievement levels were established in the early 1990s—I was on the National Assessment Governing Board (NAGB) then—state leaders and others were hungry to “How good is good enough?” on various gauges of student and school performance. The authors of A Nation at Risk had to rely on norm-referenced test results and SAT scores to form their bleak conclusions about the parlous condition of American education. Meeting in Charlottesville six years later, the governors and President Bush 41 set “national goals” for American education by the year 2000. One of those goals ambitiously declared that, by century’s end, “American students will leave grades four, eight, and twelve having demonstrated competency in challenging subject matter including English, mathematics, science, history and geography.”
But who was to say what “competency in challenging subject matter” meant, or how to gauge student progress toward it?
Using new statutory authority conferred in 1988, NAGB resolved to try to answer that question by establishing “achievement levels.” These benchmarks would enable NAEP results to be reported according to how well students (and states, etc.) were actually doing, rather than in relation to “scale scores” that have meaning only to psychometricians.
We agonized over how many levels there would be and what to call them, eventually settling on three and boldly declaring that the middle level—“proficient”—was the desired level of educational performance for young Americans.
Yes, it was aspirational (just like Common Core and scores of 4 or 5 on PARCC!). Yes, we knew that most young Americans weren’t there yet. Yes, the achievement levels were destined to be controversial—Loveless summarizes that history. Seems it’s not possible for “experts” at places like the National Academy of Sciences to countenance anything that is ultimately based on human judgment rather than some sort of experiment or regression.
It’s also a fact that many people thought (and still think) that NAEP’s achievement levels, especially “proficient,” expect too much from American schools and students. But, guess what? Recent painstaking research has shown that proficiency in twelfth-grade reading on NAEP equates pretty closely to college readiness. (The corresponding math score is closer to proficient than to basic.) Tom seems to think that the complaints about NAEP’s difficulty are proven by the fact that even high-performing Asian countries boost only 60–70 percent of their kids to the TIMSS equivalent of “NAEP proficient” in eighth-grade math. To which I reply: For the United States to reach that point would be nothing short of transformational.
If NAEP’s achievement levels are too ambitious for American students, then we have further evidence (as if any were needed) that today’s actual performance by the majority of those students is a long, long way from where it needs to be if we’re at all serious about their readiness for college-level work.
For more than two decades now, NAEP achievement levels have been the closest thing America has had to “national standards.” Yes, they’re ambitious—at least “proficient” and “advanced” are—but so is the goal of getting young Americans prepared for success in college and career.
One presumes that Tom Loveless, former teacher and great guy that he is, shares that aspiration. If so, he should quit knocking the achievement levels. (They already have plenty of wrong-headed critics without him joining the chorus.) And he should explain to Campbell Brown and others that “grade level,” as commonly used, is a hollow and meaningless metric.
On this week’s podcast, Mike Petrilli and Alyssa Schwenk discuss school segregation, four directions states can take under ESSA, and what Hamilton says about grit and privilege. During the Research Minute, David Griffith summarizes ACT’s national curriculum survey.
"ACT National Curriculum Survey 2016," ACT (June 2016).
The Every Student Succeeds Act (ESSA) requires states to incorporate at least one non-academic indicator—which might include (but isn’t limited to) factors like school climate or safety—into their accountability frameworks. That makes this study, published in Educational Researcher, well-timed. The authors set out to test the theory that reductions in school violence and/or improvements to school climate would lead to improved academic outcomes. Instead, the evidence they discovered suggests that the relationship flows in the opposite direction: A school’s improvement in academic performance led to reductions in violence and improved climate—not the other way around.
The authors found serious gaps in prior studies of school climate and safety, many of which illustrated only correlation (not causation) among the variables examined. This motivated them to test the assumption that improved school climate must come first in the chicken-egg scenario. Using six years of student survey results (2007–13) from a representative sample of 3,100 California middle and high schools, analysts employed a research design known for its ability to test causality when large-scale experimental designs aren’t possible. (For the curious, this is described as a “cross-lagged panel autoregressive modeling design,” which determines whether variables at different points in time are correlated with or impact one another). They looked at three waves of survey data based on student reporting of school violence and school climate, along with schools’ academic performance (as measured by California’s academic performance index). Controlling for each variable’s relationship to the others, the analysts examined whether gains in one time period would lead to improvements in another.
Not surprisingly, the present study confirms that school violence and climate are closely associated. Like past studies, it also confirms that low levels of violence and positive school climates are associated with high levels of school performance. But the characteristics of a safe and positive school aren’t necessarily a prerequisite for higher achievement. Researchers found that higher school performance in the first wave of data (2007–09) led to both lower school violence and higher school climate ratings in the second wave of data (2009–2011). This pattern remained true for the third wave of data (2011–2013). Meanwhile, they found no evidence that reducing violence or improving school climate led to increased academic performance across the time periods studied. (They hypothesized, however, that schools undertaking academic improvements might automatically include “issues of climate and victimization” as part of their reform efforts.)
The researchers conclude that academic improvement is “a central factor in reducing violence and enhancing a school’s climate.” To explain the findings, they noted that teachers who hold high expectations for students academically may have more positive relationships with them generally. In addition, one can imagine that improved teaching contributes to a more positive school culture overall. For example—as any teacher can attest—better instruction diminishes time spent off task, as well as the misbehavior associated with it.
Without further study, however, it’s difficult to know exactly how improved academic outcomes foster better climates and lower violence in the schools studied—or to what extent it was better teaching and school leadership that drove the school improvement to begin with. One is also left wondering how much academic achievement can be boosted in a school with a negative culture and unsafe corridors. Still, this is an interesting study that lends credence to the idea that school improvement efforts must focus on academic outcomes as much as—or at least concurrent with—attempts to improve climate and safety.
Source: Rami Benbenishty, Ron Avi Astor, Ilan Roziner, and Stephanie L. Wrabel, “Testing the Causal Links Between School Climate, School Violence, and School Academic Performance: A Cross-Lagged Panel Autoregressive Model,” Educational Researcher (April 2016).
In this survey, ACT asked thousands of K–12 teachers, college instructors, and workforce supervisors and employees about their views on current educational practices and “college and career readiness expectations.” According to ACT, these expectations rightly include not only “core academic skills” in English, reading, mathematics, and science, but also “cross-cutting capabilities” like technological literacy and collaborative problem solving, “behavioral skills” related to self-regulation, and “education and career navigation skills.” (No one could accuse the organization of having a narrow perspective.)
Overall, survey respondents identified “acting honestly” and “sustaining effort” as the most important “non-academic characteristics” for young people to develop. And in a separate set of questions, “content knowledge” and “conscientiousness” were ranked highly by every group, from elementary school teachers to workplace supervisors. However, two skill areas were ranked highly only by workforce respondents: technology (by employees) and collaboration with peers (by supervisors).
Based on these results, the authors recommend that state and local education agencies track the development of students’ non-academic skills and incorporate them into instruction. They also suggest that states and districts invest in technology training for teachers. Both suggestions might be sensible in a world of perfect information and implementation, but as matters stand, they seem to rest on a rather thin reed.
On a tangentially related note, the survey also asked respondents about their views on the Common Core State Standards, which turned out to be decidedly mixed. Most notably, among high school teachers who said they were familiar with the standards, 42 percent said they are either “a great deal” or “completely” aligned with college instructors’ expectations regarding college readiness (which is, let’s acknowledge, a rather high bar). Similarly, 40 percent of college instructors reported that degree of alignment between the Common Core and their expectations about college readiness.
Building on this finding, the report highlights a few apparent disconnects between the standards and the college experience. In particular, it notes that high school teachers focus on “source-based writing,” while college instructors “appear to value the ability to generate sound ideas more than some key features of source-based writing.”
Perhaps. Or perhaps college professors value this ability because they assume students have already mastered “source-based writing,” or because (unlike high school teachers) they view their students as mature enough to know a sound idea when they see one. (It’s also worth noting that the survey was of college English teachers; history teachers, for instance, might have felt differently about writing from sources.)
In response to the report, Common Core defenders have pointed to several areas emphasized by the standards (such as “distinguishing fact from opinion”) that are ranked highly by college professors. And some have criticized the report as a disingenuous attack on Common Core.
That may seem harsh, but there is little doubt that the survey results are open to interpretation. Connecting what’s taught in school to what matters later on is obviously a worthy goal. But perhaps, as is so often the case, the real issue is that it is so much easier said than done.
SOURCE: “ACT National Curriculum Survey,” ACT, Inc. (June 2016).
A new study by WestEd researchers looks at the validity of ratings from the Charlotte Danielson Framework for Teaching, a very popular classroom observation instrument often used in teacher evaluation systems.
The study is small in scope, examining the framework’s use in just one district (Nevada’s Washoe County, which we profiled a few years ago for its work in implementing the Common Core.) Its purpose was to determine whether the ratings differentiate among teachers, measure distinct areas of teaching practice, and link to teacher effectiveness.
The data cover 713 Washoe elementary, middle, and high school teachers (both tenured and non-tenured) who were observed on all twenty-two components of the Danielson instrument in the 2012–13 school year. The instrument covers four domains: planning and preparation, classroom environment, instruction, and professional responsibilities. Each domain has five or six components that roll up into a single four-point rating for the domain (from ineffective to highly effective).
Key findings: Ratings showed at least 90 percent of teachers were rated effective or highly effective on nearly every one of the twenty-two components, with “effective” the most common rating. So principals tend to use the ratings to discriminate between effective and highly effective teachers but rarely use the minimally effective or ineffective ratings.
Researchers also found that, within each domain, principals were consistent in their scoring: Teachers who received a high rating for one component tended to receive a high rating for others in that domain as well. Data also showed (via “confirmatory factor analysis”) that the domains are not measuring different aspects of teaching but, rather, a single dimension; thus, they recommend that the district use a single rating.
Finally, for a subset of teachers who teach grades 4–8, correlations showed a statistically significant positive relationship between observation scores and student growth (for students who had attended Nevada schools for at least one previous year). Not many details are provided on these latter analyses—nor are these methods rigorous or longitudinal—but they did reveal a relationship.
The district intends to use the ratings in its state teacher evaluation system to identify areas of needed professional development, as well as to determine performance bonuses and tenure/retention decisions—which is a lot. Determining whether it is valid for these multiple purposes is a good idea, since results here suggest that the lengthy instrument could be made a lot simpler for rating purposes. But because principals don’t distinguish among components, the instrument is not a great tool for identifying professional development needs. The outcome analysis is thin, so identifying teachers for retention, termination, or pay raises does not inspire much confidence, either—at least not based on this one study. Bottom line: Users beware.
SOURCE: Andrea Lash, Loan Tran, and Min Huang, "Examining the validity of ratings from a classroom observation instrument for use in a district’s teacher evaluation system," WestEd (May 2016).