Evaluating the Content and Quality of Next Generation Assessments: A Preview
By Amber M. Northern and Michael J. Petrilli
By Amber M. Northern and Michael J. Petrilli
The Thomas B. Fordham Institute has been evaluating the quality of state academic standards for nearly twenty years. Our very first study, published in the summer of 1997, was an appraisal of state English standards by Sandra Stotsky. Over the last two decades, we’ve regularly reviewed and reported on the quality of state K–12 standards for mathematics, science, U.S. history, world history, English language arts, and geography, as well as the Common Core, International Baccalaureate, Advanced Placement and other influential standards and frameworks (such as those used by PISA, TIMSS, and NAEP). In fact, evaluating academic standards is probably what we’re best known for.
For most of the last two decades, we’ve also dreamed of evaluating the tests linked to those standards—mindful, of course, that in most places, the tests are the real standards. They’re what schools (and sometimes teachers and students) are held accountable for, and they tend to drive curricula and instruction. (That’s probably the reason why we and other analysts have never been able to demonstrate a close relationship between the quality of standards per se and changes in student achievement.) We wanted to know how well matched the assessments were to the standards, whether they were of high quality, and what type of cognitive demands they placed on students.
But with fifty-one different sets of tests, such an evaluation was out of reach—particularly since any bona fide evaluation of assessments must get under the hood (and behind the curtain) to look at a sizable chunk of actual test items. Getting dozens of states—and their test vendors—to allow us to take a peek was nigh impossible.
So when the opportunity came along to conduct a groundbreaking evaluation of Common Core-aligned tests, we were captivated. We were daunted, too—both by the enormity of the task and by the knowledge that our advocacy of the standards would likely cause any number of doubters and critics to sneer at such an evaluation coming from us, regardless of its quality or impartiality.
So let’s address that first. It’s true that we continue to believe that children in most states are better off with the Common Core standards than without them. If you don’t care for the standards (or even the concept of “common” standards), or perhaps you come from a state that never adopted these standards or has since repudiated them, you can probably ignore this study. Our purpose is not to re-litigate the Common Core debate. Rather, we want to know, for states that are sticking with the common standards, whether the “next-generation assessments” that have been developed to accompany the standards deliver what they promised in terms of strong content, quality, and rigor.
No single study can come close to evaluating all of the products in use and under development in today’s busy and fluid testing marketplace. But we were able to provide an in-depth appraisal of the content and quality of three “next-generation” assessments—ACT Aspire, PARCC, and Smarter Balanced—and one best-in-class state test, the Massachusetts Comprehensive Assessment System (MCAS, 2014). In total, over thirteen million children (about 40 percent of the country’s students in grades 3–11) took one of these four tests in spring 2015. Of course it would be good to encompass even more. Nevertheless, this study ranks as possibly the most complex and ambitious single project ever undertaken by Fordham.
After we agreed to myriad terms and conditions, we and our team of nearly forty reviewers were granted secure access to operational items and test forms for grades five and eight (the elementary and middle school capstone grades that are the study’s focus).
This was an achievement in its own right. It’s no small thing to receive access to examine operational test forms. This is especially true in a divisive political climate where anti-testing advocates are looking for any reason to throw the baby out with the bathwater—and where market pressure gives test developers ample reason to be wary of leaks, spies, and competitors. Each of the four testing programs is to be commended for allowing this external scrutiny of their “live” tests, which cost them much by way of blood, sweat, tears, and cash to develop and bring to market. They could have easily said, “Thanks, but no thanks.” But they didn’t, and for that, we’re grateful. But educators, policy makers and taxpayers also owe each test developer a debt of thanks for their commitment to transparency and public accountability, which is essential to public confidence in assessments whose results hold such outsized importance for K–12 education.
Part of the reason they agreed was the care we took in recruiting smart, respected individuals to help with this project. Our two lead investigators, Nancy Doorey and Morgan Polikoff, bring a wealth of experience in educational assessment and policy, test alignment, academic standards, and accountability. Nancy has authored reports for several national organizations on advances in educational assessment and copiloted the Center for K–12 Assessment and Performance Management at ETS. Morgan is an assistant professor of education at the University of Southern California and a well-regarded analyst of the implementation of college and career readiness standards. He is an associate editor of the American Educational Research Journal, serves on the editorial board for Educational Administration Quarterly, and is the top finisher in the RHSU 2015 Edu-Scholar rankings for junior faculty.
Nancy and Morgan were joined by two well-respected content experts who facilitated and reviewed the work of the ELA/Literacy and math review panels. Charles Perfetti, a distinguished university professor of psychology at the University of Pittsburgh, served as the ELA/Literacy content lead; Roger Howe, a professor of mathematics at Yale, served as the math content lead.
Given the importance and sensitivity of the task at hand, we spent months recruiting and vetting the individuals who would eventually make up the panels led by Dr. Perfetti and Dr. Howe. We began by soliciting recommendations from each testing program and other sources (including content and assessment experts, individuals with experience in prior alignment studies, and several national and state organizations). In the end, we recruited at least one reviewer recommended by each testing program to serve on each panel; this strategy helped to ensure fairness by equally balancing reviewer familiarity with the various assessments.
So how did our meticulously assembled panels go about evaluating the tests? The long version will be available tomorrow, including ample detail about the study design, testing programs, criteria, and selection of test forms and review procedures.
But the short version is this: We deployed a brand-new methodology developed by the Center for Assessment to evaluate the four tests—a methodology that was itself based on the Council of Chief State School Officers’ 2014 “Criteria for Procuring and Evaluating High-Quality Assessments.” Those criteria, say their authors, are “intended to be a useful resource for any state procuring and/or evaluating assessments aligned to their college and career readiness standards.” This includes, of course, tests meant to accompany the Common Core standards.
The CCSSO Criteria address the “content” and “depth” of state tests in both English language arts and mathematics. For ELA, “content” spans topics such as whether students are required to use evidence from texts; for math, they are concerned with whether the assessments focus strongly on the content most needed for success in later mathematics. The “depth” criteria for both subjects include whether the tests required a range of “cognitively demanding,” high-quality items that make use of various item types (e.g., multiple choice, constructed response), among other things.
The Center for Assessment took these criteria and transformed them into a number of measurable elements that reviewers addressed. In the end, the newly minted methodology wasn’t perfect. Our rock star reviewers improved upon it and wanted others following in their footsteps to benefit from their learned experience. So we made adjustments along the way.
The panels essentially evaluated the extent of the match between the assessment and a key element of the CCSSO document. They assigned one of four ratings to each ELA and math-specific criterion, such that tests received one of four “match” ratings: Excellent, Good, Limited/Uneven, or Weak Match. To generate these marks, each panel reviewed the ratings from the grade-five and grade-eight test forms, considered the results from the analysis of the program’s documentation (which preceded the item review), and came to consensus on the rating.
***
We at Fordham don’t plan to stay in the test-evaluation business. The challenge of doing this well is simply too overwhelming for a small think tank like ours. But we sincerely hope that others will pick up the baton, learn from our experience, and provide independent evaluations of the assessments in use in the states that have moved away from PARCC, Smarter Balanced, or ACT Aspire.
Not only will such reviews provide critical information for state and local policy makers, educators, curriculum developers, and others; they might also deter the Department of Education from pursuing a dubious plan to make states put their new assessments through a federal evaluation system. In October 2015, the department issued procedures for the “peer review” process that had been on hold for the last three years. The guidelines specify that states must produce evidence that they “used sound procedures in design and development to state tests aligned to academic standards, and for test administration and security.” Count us among those who think that renewed federal vetting of state tests invites unwanted meddling from Uncle Sam (and could spark another round of backlash akin to what befell the Common Core itself a few years back). Besides, twelve years during which the department already had such guidance in place did little to improve the quality of state tests—hence the recent moves to improve them.
***
Now you’ve got the background. Come back to our website on Thursday morning for the results. Trust us: It will be worth the wait.
New York State education officials raised a ruckus two weeks ago when they announced that annual statewide reading and math tests, administered in grades 3–8, would no longer be timed. The New York Post quickly blasted the move as “lunacy” in an editorial. “Nowhere in the world do standardized exams come without time limits,” the paper thundered. “Without time limits, they’re a far less accurate measure.” Eva S. Moskowitz, founder of the Success Academy charter schools had a similar reaction. “I don’t even know how you administer a test like that,” she told the New York Times.
I’ll confess that my initial reaction was not very different. Intuitively, testing conditions would seem to have a direct impact on validity. If you test Usain Bolt and me on our ability to run one hundred meters, I might finish faster if I’m on flat ground and the world record holder is forced to run up a very steep incline. But that doesn’t make me Usain Bolt’s equal. By abolishing time limits, it seemed New York was seeking to game the results, giving every student a “special education accommodation” with extended time for testing.
But after reading the research and talking to leading psychometricians, I’ve concluded that both the Post and I had it wrong—untimed tests are not less accurate. While there’s not a deep body of research on timed versus untimed tests, the studies that do exist indicate that for non-learning-disabled students, extra time does not significantly alter outcomes. Students with learning disabilities have been found to perform significantly better under extended time conditions than they do under fixed time conditions, which is why special education students often get extra time on exams. But for students without disabilities, no significant differences have been found.
Broadly speaking, there are two types of tests: speed and power tests. In a speed test—like a timed multiplication drill—the method of answering the questions (multiply) is clear and obvious. The test seeks to determine how many questions you can answer correctly in the allotted time. A power test, on the other hand, presents students with a smaller number of more complex questions. When what’s being tested is your ability to figure out how to answer the questions, your speed and the time allotted don’t matter as much.
In a “speed” test, Usain Bolt would kick my butt. We’d likely score the same on a hundred-meter “power” test, but running ability would not be the issue. We could walk, skip, or turn cartwheels, since our ability to cover the distance is what’s being measured, not how quickly we do it.
Most state tests are power tests, notes professor Andrew Porter, the former dean of the University of Pennsylvania's Graduate School of Education and a past president of the American Educational Research Association. They are designed so that nearly all students will be able to complete all items within the allotted time. Thus, there’s no reason to expect any difference in performance if time limits are dropped. “Some students, if given a lot of time, will take a lot of time,” Porter notes. “It doesn’t mean they’re going to do any better.”
Should other states follow New York’s lead? For many, the point is moot. Eighteen states administer untimed, computer-based tests from the Smarter Balanced Assessment Consortium (SBAC). These “adaptive” tests don’t interpret speed as significant. If two test takers are presented with the same item and provide the same correct answer, but one does it twice as fast, both will still get the same next item.
If lifting time limits doesn’t change results or validity, why bother? Consider it a power test of the “opt-out” movement—and one New York might fail. Last year, the number of students declining to take state exams in New York quadrupled to 20 percent of all those who were supposed to be tested, according to data from the state’s education department. That makes New York one of the biggest opt-out states in the nation. Education officials seem to be gambling that allowing unlimited time will give parents one less reason to complain about “test pressure.”
My gut tells me that even if eliminating time limits blunts a bit of opt-out grumbling for now, the real source of test pressure is not the clock—it’s adults pressuring kids to perform. Any leveling or reduction in the number of parents refusing to let kids sit for state tests this year will likely be a function of New York’s moratorium on linking test scores to teacher evaluations. With less at stake for now, school administrators and teachers are less likely to transfer their anxieties to students, wittingly or unwittingly. But look for the pressure to return with a vengeance when the moratorium ends in 2020.
Ironically, the move to end time limits could backfire. Not because it will alter the validity of the tests, but simply because it’s a waste of time and money. “I would never, ever give a test that I didn’t put a time limit on,” Porter says. “In most states across the country, there’s a move to decrease the amount of time spent testing. This moves in the opposite direction.”
Note: A shorter version of this piece originally ran in the New York Daily News.
In this week's podcast, Mike Petrilli and Robert Pondiscio preview Fordham’s long-awaited assessments evaluation, analyze low-income families’ education-related tech purchases, and wave the red flag about TFA’s lurch to the Left. In the Research Minute, David Griffith examines how well the nation’s largest school districts promote parent choice and competition between schools.
Grover (Russ) J. Whitehurst, "Education Choice and Competition Index 2015," Brookings (February 2016).
For the past few years, Russ Whitehurst of the Brookings Institution has ranked the nation’s hundred largest school districts based on the amount of school choice they give to families and the degree to which they promote competition between schools. In many ways, these rankings are similar to Fordham’s own choice-friendly cities list, though the unit of analysis and metric differ somewhat. As in prior years, five of Brookings’s thirteen indicators concern the availability, accessibility, comparability, clarity, and relevance of information about school performance—a far heavier emphasis than one finds in Fordham’s metric. The other eight indicators deal with topics such as school closure, transportation, and the existence of a common application for district schools, several of which are common to both reports.
Though not one of the nation’s largest districts, the Recovery School District in New Orleans is again included in the Brookings rankings because of its unique status within the school choice movement. Once again, it ranks first overall. Yet in the report accompanying this year’s rankings, Whitehurst argues that because of its unique circumstances, New Orleans isn’t a realistic model for other districts. He points instead to Denver, now listed second overall and first among large school districts. (New Orleans and Denver finished first and third respectively in Fordham’s study.)
Other districts Brookings rates highly include New York, Newark, D.C., Houston, Boston, and Baltimore—all of which (with the exception of D.C.) finished lower in Fordham’s overall rankings, while Milwaukee and Indianapolis finished higher. These differences reflect a number of factors, one of which is the comparatively greater emphasis on public versus private school choice in Brookings’ report.
Indeed, the report’s treatment of public school choice—and open enrollment systems in particular—is admirably sophisticated, insightful, and politically astute. In particular, it argues that “the cities that are closest to having a system that supports full and equitable open enrollment are exposing the limitations of a design perspective that prioritizes abstract features such as fairness, efficiency, stability, and universality to the exclusion of factors that are high priorities for many parents.” The result is a system that seems “very good from an intellectual perspective but creates undesirable levels of dissatisfaction among its users.”
This is spot-on, as far as I’m concerned, as is the proposed solution—a design perspective that applies the insights of behavioral economics to school choice by “constraining the initial menu of choices and nudging the shopper towards alternatives that evidence suggests should be preferred.” For example, Boston parents have the option of “rejecting the pre-populated list [of schools] and choosing instead from a much larger list of options.” Fantastic. Let’s do that everywhere.
Similarly, the report rightly acknowledges that eliminating neighborhood preference is politically impossible in some districts due to the psychology of “loss-aversion” (parents prefer certainty to uncertainty because they worry about bad outcomes more than they care about good ones) and the very real losses incurred by richer families. Again, the report offers a sensible solution: Parents should opt out of lotteries rather than opting into them, thereby encouraging choice without requiring it. Only the purest libertarian could object to such a strategy.
Ultimately, the Brookings and Fordham reports complement each other, despite their differences, and their common message can be summed up in one word: progress.
SOURCE: Grover J. (Russ) Whitehurst, “Education Choice and Competition Index 2015: Summary and Commentary,” Brookings Institution (February 2016).
Detroit Public Schools recently made national headlines for the heartbreaking conditions of its school facilities and a widespread teacher “sick-out.” For Detroit, these are sadly just the latest hurdles to overcome: The public school system has been in dire financial straits for many years, while national testing data indicate that the district’s students are among the lowest-achieving in the nation.
A report from the Lincoln Institute, a nonprofit that focuses on land use and tax policy, provides a fascinating angle on the Detroit situation. It highlights the massive problems that the Motor City encounters when trying to finance public services, including education, through its local property tax system. Consider just a few bleak statistics reported in this paper: 1) The property tax delinquency rate was a staggering 54 percent in 2014; 2) roughly eighty thousand housing units are vacant—23 percent of Detroit’s housing stock; 3) and 36 percent and 22 percent of commercial and industrial property, respectively, sat vacant.
The report also highlights ways that property tax policies exacerbate the school system’s revenue woes. First, property tax abatements—tax breaks aimed at spurring re-investment—have reduced or exempted the tax liabilities of more than ten thousand properties. Whether the benefit of these reductions outweighs the cost to public services is hotly debated, but the fact remains that schools lose potential funding. Second, the report spotlights the significant number of tax-exempt parcels in Detroit (owned by public or nonprofit agencies), an issue that is being further aggravated as foreclosed properties transfer to public ownership. Third, property assessment policy (how a property’s value is determined) has apparently been a disaster. As reported by the media, widespread problems in assessing property values in 2013 led public officials to retract the valuations and start over.
What lessons do we learn from the Detroit debacle? The reliance on property taxes can put schools at risk when local conditions deteriorate. State policy makers need to be careful to ensure that students aren’t lost when the tax base collapses. Meanwhile, though not the topic of the Lincoln Institute paper, schools must also do their part when facts on the ground worsen. As difficult as it may be, this means addressing costs, including reductions in force (ideally, dismissing the lowest-performing teachers first) or selling underutilized property. Thankfully, most states’ urban school systems aren’t in the deplorable condition of Detroit; yet they should absolutely be on guard and learn how not to go broke.
SOURCE: Gary Sands and Mark Skidmore, “Detroit and the Property Tax: Strategies to Improve Equity and Enhance Revenue,” Lincoln Institute of Land Policy (November 2015).