The reality of rigor (more on the D.C. vouchers study)

Amber M. Northern, Ph.D.

6.17.2008

Mike opened the door for my response to the Washington Opportunity Scholarship Program external evaluation, and I've just completed a fairly quick read of it. First, in the spirit of full disclosure, I'll note that my former employer, Westat, was the prime contractor for the evaluation. Though I never personally worked with the Westat staff who conducted the evaluation, I do know their reputations for quality work. This is not the only reason, of course, that I found the evaluation to be of high-quality, but it's worth mentioning. Disclosure aside, I have a couple takeaways from the evaluation.

First, the impact findings for the program are simply not that compelling (sorry Mike), and even the subgroup analyses--which do provide a ray of hope--are presented with important caveats. The design comprised a randomized controlled trial where eligible applicants were randomly assigned to receive or not receive the scholarship. By all accounts, the sample was drawn appropriately and is of sufficient size (n=2,308 which is, we're told, larger than impact samples in previous, similar evaluations); furthermore, the analyses appear thoughtfully and meticulously conducted.

So, while I have few qualms with the evaluation design itself, I do think something that occurred naturally within the impact sample--namely, lots of student mobility--is worth keeping in mind. Over the course of two years in the treatment group, only 4 percent remained in the same school they were in when they applied to the program; 71 percent switched schools once, and 25 percent switched schools twice. Among the control group, 22 percent remained in the same school they were in when they applied to the program; 57 percent switched schools once; and 21 percent switched schools twice. That's a majority of kids (even more so in the treatment group) not attending any one participating school for very long. The authors report that "both groups experienced higher rates of school mobility than the typical annual rate for urban students (22 to 28 percent)." It's not surprising, then, to see unimpressive findings in an evaluation that covers such a short duration (2 years) and examines achievement data from students who are extremely transient (not to mention that students were tested on Saturdays!).

Second, I'm struck by the number of times that the phrase "adjustments for multiple comparisons suggest that this finding may be a false discovery" (or similar nomenclature) appears in the report. Researchers concern themselves with multiple comparisons because they are in a position of simultaneously evaluating multiple questions and hypotheses. Simply put, when you consider the results of multiple, separate statistical tests together, there is more room for error. The issue has gotten more attention of late, in part because of this recent report from IES which presents methods for dealing with the multiple comparisons problem. Like most people involved with education, I'm interested in the best research possible given the time and resources available to conduct it. Many statisticians believe that ignoring the multiplicity problem leads to misinterpretation of findings, so these researchers covered their bases.

But with all of those "false discovery" caveats in the report, I found myself harkening back to Judith Gueron's comments in this book. Ms. Gueron (of Manpower Demonstration Research Corporation or MDRC) writes:

Finally, rigor has its drawbacks. Peter Rossi once formulated several laws about policy research, one of which was: the better the study, the smaller the likely impact. High quality policy research must continuously compete with the claims of greater success based on weaker evidence.

Ahh, so true. Sooner or later, we must come to terms with the fact that the bar we set for rigor may unintentionally and preemptively knock out of the running a program that may, in fact, make some improvement in American education.?? Mind you, I'm not calling for a return to the age of education anecdote equals research. Here's Gueron again on a lesson she learned about running successful social experiments:

You do not need dramatic results to have an impact on policy. Many people have said that the 1988 welfare reform law, the Family Support Act, was based and passed on the strength of research--and the research was about modest changes. When we have reliable results, it usually suggests that social programs (at least the relatively modest ones tested in this country) are not panaceas but that they nonetheless can make improvements. One of the lessons I draw from our experience is that modest changes have often been enough to make a program cost-effective and can also be enough to persuade policymakers to act. However, while this was true in the mid 1980's, it was certainly not true in the mid 1990's. In the last round of federal welfare reform, modest improvements were often cast as failures.

The question is: Will the OSP ultimately pass the "modest improvement" test? At two years--a time period that's too short to capture impacts that may evolve over time--we don't know. What I do know is that parents believe the OSP is making improvements, that improvement for certain groups of students may exist, and that school choice in and of itself may prove a laudable goal even without raise-the-roof achievement gains. Also, as an educational community, we'd be wise to continue the dialogue around the financial, political, methodological, and common-sensical (I think that's a word) tradeoffs involved in rigorous research.