In defense of annual testing
Testing works. Federal intrusiveness and poorly designed interventions are the real problem. Andy Smarick
Testing works. Federal intrusiveness and poorly designed interventions are the real problem. Andy Smarick
I’m writing this now in hopes I won’t have to write a future piece that starts: “Alas, a bad idea whose time has come…”
The bad idea is ending annual testing in grades 3–8, which may emerge as a consensus response to concerns about the state of standards, assessments, and accountability.
Clearly, testing is under fire generally. AFT head Randi Weingarten wants to do away with the federal requirement that students take annual assessments. Anti-testing groups are hailing state-based “victories” in rolling back an array of assessments and accountability provisions. Even Secretary Duncan recently expressed misgivings about the amount of time being dedicated to testing.
But the specific idea of returning—regressing—to “grade-span” testing might be gaining steam. Former President Bill Clinton recently said, “I think doing one in elementary school, one in the end of middle school and one before the end of high school is quite enough if you do it right.” At least two bills have been introduced in the House of Representatives to retreat to grade-span testing: One got public support from the NEA, and the other was saluted by the AFT.
What might be even more notable is the lack of vocal defense being mustered for annual testing by long-time advocates for strong accountability. Checker Finn took to National Review Online arguing for an “accountability reboot.”
Among other things, he wrote, “It’s probably time for education reformers and policymakers to admit that just pushing harder on test-driven accountability as the primary tool for changing our creaky old public-school system is apt to yield more backlash than accomplishment.” Among those elements deserving reconsideration are “NCLB-era strategies,” which, of course, include annual assessments that drive school determinations and interventions.
Similarly, annual student testing was conspicuously absent from the statement of accountability principles issued by a group convened by CRPE and Fordham.
Now, to be clear, Duncan, Finn, and the CRPE-Fordham signatories still believe in assessments. In different ways, each noted the invaluable information generated by test results and the importance of using that data to improve student learning. And right now, the leaders of the House and Senate education committees still support annual testing.
I can’t imagine our totally doing away with state tests. But a growing number of reform-oriented leaders may conclude that the cumulative frustration with NCLB, ESEA waivers, Common Core, and common assessments requires a concession. Annual testing could be sacrificed.
Indeed, former Secretary Margaret Spellings recently told Ed Week that unfavorable political conditions, combined with USED’s waiver vacillations, have placed “muscular accountability at risk.” I could also imagine proponents of emerging technologies, individualized learning, and competency-based assessments giving up on end-of-year annual tests as relics of a bygone era.
But before we retreat to the pre-NCLB era of grade-span testing, assess samples of students instead of all students, or revert to some other testing-light position, let’s at least recall some of the benefits of annual testing of all kids.
Now, I’m not saying that people shouldn’t be aggravated by federal intrusiveness or unfitting school classifications and interventions. But these concerns (and others) argue for additional performance metrics, different methods of assessing and classifying schools, and new approaches to school improvement. Jettisoning annual testing is a punishment that doesn’t fit the crime.
Let's continue to critically analyze the consequences of NCLB and ESEA waivers (as folks like Anne Hyslop are doing ably). And let’s explore new ways of executing state governments’ responsibility to ensure high-quality schools for all kids.
But let’s also avoid tossing the baby with the bath water. Remember, there’s a reason why today’s era of standards, assessments, and accountability exists: The era that proceeded it wasn’t working for entirely too many kids.
Without a doubt, and in the main, testing has done more good than harm in America’s schools. My Fordham colleague Andy Smarick is absolutely correct to argue that annual testing “makes clear that every student matters.” The sunshine created by testing every child, every year has been a splendid disinfectant. There can be no reasonable doubt that testing has created momentum for positive change—particularly in schools that serve our neediest and most neglected children.
But it’s long past time to acknowledge that reading tests—especially tests with stakes for individual teachers attached to them—do more harm than good. A good test or accountability scheme encourages good instructional practice. Reading tests do the opposite. They encourage poor practice, waste instructional time, and materially damage reading achievement, especially for our most vulnerable children. Here’s why:
A test can tell you whether a student has learned to add unlike fractions, can determine the hypotenuse of a triangle, or understands the causes of the Civil War—and, by reasonable extension, whether I did a good or poor job as a teacher imparting those skills and content. But reading comprehension is not a skill or a body of content that can be taught. The annual reading tests we administer to children through eighth grade are de facto tests of background knowledge and vocabulary. Moreover, they are not “instructionally sensitive.” Success or failure can have little to do with what is taught.
A substantial body of research has consistently shown that reading comprehension relies on the reader knowing at least something about the topic he or she is reading about (and sometimes quite a lot). The effects of prior knowledge can be profound: Students who are ostensibly “poor” readers can suddenly comprehend quite well when reading about a subject they know a lot about—even outperforming “good” readers who lack background knowledge the “poor” readers possess.
Reading tests, however, treat reading comprehension as a broad, generalized skill. To be clear: Decoding, the knowledge of letter-sound relationships that enables you to pronounce correctly written words, is a skill. This is why early instruction in phonics is important. But reading comprehension, the ability to make meaning from decoded words, is far more complex. It’s not a skill at all, yet we test it like one, and in doing so we compel teachers to teach it like one. Doing so means students lose.
Even our best schools serving low-income children—public, parochial, and charter alike—have a much harder time raising ELA (English language arts) scores than math. This is unsurprising. Math is school-based and hierarchical (there’s a logical progression of content to be taught). But reading comprehension is cumulative. The sum of your experiences, interests, and knowledge, both in and out school, contribute to your ability to read with understanding. This is why affluent children who enjoy the benefit of educated parents, language-rich homes, and ample opportunities for growth and enrichment come to school primed to do well on reading tests—and why reading scores are hard to move.
Teacher quality plays a role, but note how fourth-grade NAEP math scores have risen over the years while reading has remained flat, even though the same teacher usually handles both subjects. This suggests that our teachers, when they know what to teach, are stronger than we think. In math, standards, curriculum and assessments are closely aligned (there’s no surprising content on math tests). By treating reading as a collection of content-neutral skills, we make reading tests a minefield for both kids and teachers.
The text passages on reading-comprehension tests are randomly chosen, usually divorced from any particular body of knowledge taught in school. New York State’s Common Core-aligned fifth-grade reading test earlier this year, for example, featured passages about BMX bike racing and sailing. The sixth-grade test featured a poem about “pit ponies,” horse and donkeys used in mines to pull carts of ore. Another passage described how loggerhead sea turtles navigate based on Earth’s magnetic field. That sounds more “school-based,” but in the absence of a common curriculum, there’s no guarantee that New York sixth-graders learned about sea turtles or Earth’s magnetic field from their sixth-grade teacher, by watching Magic School Bus, or (alas) ever. Students who had prior knowledge, whether from home, school, a weekend museum trip with their parents, or personal interest, had an advantage. This means the test was not “instructionally sensitive”—teacher input mattered little.
Certainly, test questions are “standards-based.” One question on the sea turtles passage measured students’ ability to determine the “central idea” of the text; another focused on their ability to “cite textual evidence to support analysis of what the text says explicitly as well as inferences drawn from the text” (Standard RI.6.1). Should students fail at this task, here’s the guidance the New York State Education Department offers teachers:
To help students succeed with questions measuring RI.6.1, instruction can focus on building students’ ability to comprehend grade-level complex texts and identifying specific, relevant evidence that supports an analysis of what the text says explicitly as well as inferences drawn from the text.
This is not bad advice, per se, but it’s unlikely to build reading ability. There’s simply no guarantee that practice in identifying specific, relevant evidence that supports inferences drawn from one text or topic will be helpful in another setting. Testing, especially with value-added measures attached, functionally requires teachers to waste precious time on low-yield activities (practicing inferring; finding the main idea, etc.) that would be better spent building knowledge across subjects. We then hold teachers accountable when they follow that advice and fail, as they inevitably must. This is Kafkaesque.
Students who score well on reading tests are those who have a lot of prior knowledge about a wide range of subjects. This is precisely why Common Core calls for (but cannot impose) a curriculum that builds knowledge coherently and sequentially within and across grades. That’s the wellspring of mature reading comprehension—not “skills” like making inferences and finding the main idea that do not transfer from one knowledge domain to another.
As a practical matter, standards don’t drive classroom practice. Tests do. The first—and perhaps only—litmus test for any accountability scheme is, “Does this encourage the classroom practices we seek?” In the case of annual reading tests, with high stakes for kids and teachers, the answer is clearly “no.” Nothing in reading tests—both as currently conceived or anticipated under Common Core—encourages schools or teachers to make urgently needed, long-term investments in coherent knowledge building from grade to grade that will drive language proficiency.
What could replace them? Options might include testing reading annually, but eliminating stakes; testing decoding up to grade four; or substituting subject-matter tests to encourage teaching across content areas. The best and most obvious solution would be curriculum-based tests with reading passages based on topics taught in school. But that would require a common curriculum—surely a nonstarter when mere standards in language arts are politically upsetting.
Annual testing “makes clear that the standards associated with every tested grade and subject matter,” Andy writes. Again, I agree wholeheartedly. But reading is not a subject. It’s a verb. It’s long past time to recognize that reading tests don’t measure what we think they do.
Accountability is essential and non-negotiable, and testing works. Just not in reading.
The benefits of live theater, how and whether to discipline, detrimental reading tests, and relative school costs.
“The Relative Costs of New York City’s New Small Public High Schools of Choice,” by Robert Bifulco and Rebecca Unterman, MRDC (October 2014).
This new MRDC study examines the relative costs of approximately 200 small New York City public high schools that were created between 2002 and 2008. These schools serve mostly disadvantaged kids and are located in buildings where larger high schools with low levels of achievement had been closed. Earlier and recent randomized evaluations have found that attending a small school increased graduation rates by roughly 9 percentage points compared to other NYC public high schools. This new study asks how it cost to achieve that improvement. Analysts use five years of school-expenditure data for roughly 8,500 students who were first-time ninth graders in 2005 and 2006; they represent 84 of the 123 original set of small schools—the same sample used to estimate effects on five-year graduation rates. First, analysts examine per-pupil operating costs for the small schools compared to all other district high schools (including actual individual teacher salaries) and find that they are higher, likely because small schools can’t take advantage of economies of scale. Yet when they look into the relative cost of the intervention itself, based on its earlier demonstrated impact on graduation rates, they find two things: (1) expenditures during each of the first four years of high school are not statistically different for students in small schools versus those in other city high schools; (2) yet expenditures dropped for the small-schools cohorts because fewer of them needed a fifth year of high school (they were more likely to graduate in four years). In fact, fifth-year enrollment rates were one-third higher for the control group. As a result, they estimate that the cost per graduate for those who attended small schools is 16 percent lower than the cost per graduate for those in the control group. And that doesn’t include the income benefits for the student who graduates from high school versus dropping out. The positive impacts of small schools continue to roll in; this initiative appears not to be the disaster that many thought it was. Unfortunately, in education, we rarely have the fortitude to allow interventions to play out and observe results over the long term. Perhaps we should be more patient.
SOURCE: Robert Bifulco and Rebecca Unterman, “The Relative Costs of New York City’s New Small Public High Schools of Choice,” MRDC (October 2014).
Frank McCourt, the memoirist and legendary English teacher at New York’s Stuyvesant High School, was once challenged by a student who asked what possible use a particular work of literature would have in his life. “You will read it for the same reason your parents waste their money on your piano lessons,” McCourt replied tartly, “so you won’t be a boring little shite the rest of your life.” Perhaps schools should collect Boring Little Shite (BLS) data and report it alongside AYP and FRPM. Jay Greene seems to be working on it. A data hawk and acerbic defender of school choice and vouchers, Greene might have been voted least likely to give a damn about the arts before his surprising 2013 study linking field trips to art museums to a range of desirable outcomes, including critical thinking and empathy. He’s at it again in the current issue of Education Next with an interesting study on the effects of taking students to see live theatre, including improved grasp of the play, vocabulary, empathy, and tolerance. Greene and his co-authors make much of these enhancements over a control group who only read the plays or saw film versions. But the good effects aren’t entirely surprising. Attention is the first, most important key to learning. It stands to reason that the novel experience of attending a live performance will capture students’ attention of a play in a way that more familiar modes (watching a movie, reading) do not. Likewise, repeated exposure to vocabulary, not memorization, is the source of nearly all of our vocabulary growth. No surprise then that seeing a play—a kind of high-value read-aloud—cements words in students’ minds. And surely there’s a link between tolerance, empathy, and the flesh-and-blood representation of humanity on stage. The study, however, has some limitations. It measures relatively short-term gains; do the benefits—especially in tolerance and empathy—stick many months later? A more significant issue may be the “relative homogeneity of the students in [the] sample, with most being white and in advanced classes.” Perhaps other researchers will attempt to replicate the findings among low-income kids and among a wider variety of cultural experiences. Most interesting is Greene’s purpose in shifting his attention to the arts in the first place. “Our goal,” he writes, “is to broaden the types of measures that education researchers, and in turn policymakers and practitioners, consider when judging the educational success or failure of schools.” Hopefully that includes ensuring kids don’t become boring little shites.
SOURCE: Jay P. Greene, Collin Hitt, Anne Kraybill, and Cari A. Bogulski, “Learning from Live Theater,” Education Next, Vol. 15, No. 1 (Winter 2015).
2014 marks the first year that minority students are projected to surpass their white counterparts in public school enrollment. And nearly one in four students in American schools speak a language other than English at home. Currently, these students, categorized as “dual language learners” (DLLs), are shuffled through a four-part “reclassification” process: a screening assessment, English proficiency support services (such as vocabulary interventions), reassessment, and follow-up monitoring. Such models are mandated by the ESEA, so all states comply in one way or another—but the lack of interstate consensus on exactly how to comply has led to a “chaotic” system, says analyst Conor Williams. There are three issues: (1) local control over which of the four currently available English language proficiency assessments they administer; (2) a lack of consensus regarding when a DLL is proficient and ready for mainstream English instruction; and (3) uncertainty about how to prepare educators and create appropriate DLL instruction. By failing to coordinate reclassification policies, DLLs, who are more likely than other student subgroups to move from state to state, fall further behind their peers academically or lose their precious bilingualism—an asset schools should be nurturing, not silencing. Williams’ proposed solution? A unified set of standards, much like the Common Core State Standards, that align with current research on language acquisition timelines and encourage instruction in both native languages and English. Some states, like Minnesota, are already in the process of revamping their English Language Learner policies. And while successful implementation will take some fine-tuning, it is a conversation that is long overdue.
SOURCES: Conor P. Williams, “Chaos for Dual Language Learners: An Examination of State Policies for Exiting Children from Language Services in the PreK-3rd Grades,” New America Foundation (September 2014); Conor P. Williams, Ph.D., and Colleen Gross Ebinger, “The Learning for English Academic Proficiency and Success Act: Ensuring Faithful and Timely Implementation,” The McKnight Foundation (October 2014).
Trying to understand how education spending is influencing our education priorities is like looking through murky water, notes this report from the Data Quality Campaign: “[I]t is evident something is there, but it is not exactly clear what.” For example, education leaders need to know whether investments in interventions have an impact, whether schools with high numbers of special-needs students are receiving the resources to which they are entitled, and whether dollars spent on teacher development have led to improvements. Without a clear picture of education spending, there is little to inform decision-makers. The report proposes several solutions. First, states should find new ways to make financial data more accurate and transparent for stakeholders. This starts with changes in data collection, including a shift to a common system of financial information record-keeping across states. Second, raw financial data should be translated for use in public reports, including information that connects education dollars to outcomes. The report also encourages states to create a forum for district leaders to share best practices and learn from one another. To illustrate DQC’s proposed reforms, consider this process with funding for special-needs students: Districts could use financial data to tie how much extra funding is given to special-needs students and what services and equipment they receive. If we had this information for each district, we could begin to identify best practices and apply them across the state and beyond. These reforms require a fundamentally different system than the one currently in place, but this change is crucial if we hope to make informed financial decisions that drive results for students. Meanwhile, Fordham’s taken a stab at peering through the murky water of school financial data in the D.C. metropolitan area. The results may surprise you.
SOURCE: “Using Financial Data to Support Student Success,” Data Quality Campaign (October 2014).