NYC Public School Parents: More statistical malpractice from Tweed: Joel Klein and his claims do not measure up

Tuesday, August 5, 2008

More statistical malpractice from Tweed: Joel Klein and his claims do not measure up

Check out Elizabeth Green’s article today in the NY Sun, in which she asked three academics to scrutinize the validity of the administration’s claim of narrowing the achievement gap between ethnic and racial groups.

See, for example, Bloomberg’s recent testimony before Congress, in which he said that “over the past six years, we’ve done everything possible to narrow the achievement gap – and we have. In some cases, we’ve reduced it by half.”

Yet the evidence for this is weak to non-existent. On the national tests called the NAEPs, there has been no narrowing of the achievement gap in any area since the Bloomberg/Klein reforms were instituted:

An analysis by the National Center for Education Statistics, the research arm of the federal Education Department, concludes that no achievement gaps have narrowed at all in New York City between 2003 and 2007. The only gap that moved in any significant direction is the one between poor students and the rest of the population, which widened slightly, that analysis said. The National Center for Education Statistics also concludes that upward trends in the reading scores of black and Hispanic fourth-graders lauded by Mr. Klein are not statistically significant.

In the article, Joel Klein reveals his statistical illiteracy:

“Those are just confidence levels. Nobody is saying this is a science," Mr. Klein said. He added: "If three points is flat, and four points is statistically significant, then what you're doing is, you're playing something of a game."

Chief press officer David Cantor called the memo from NCES "a politicized gloss.”

Instead, it is the DOE who insists on playing games – and politicizing the issue, by continuing to slander experts as somehow biased when they provide objective evidence that the non-stop PR spin issuing from Tweed has no basis in reality.

This is hardly the first time the DOE has revealed such statistical malpractice. Jim Liebman, law professor and head of the DOE accountability office, is a repeat offender. I recall one episode in particular when Liebman, testifying before the City Council, insisted that he wasn't basing school grades primarily on the results of a few tests, since each test was really "multiple assessments" given out over "multiple days," resulting in "multiple measures" of proficiency.

Another instance of this was the presentation of Jennifer Bell-Elwanger, head of testing for DOE, who was giving a power point presentation to the Panel for Educational Policy in late November, following the release of the NAEP results. She continually pointed out gains that, according to the NCES, were not statistically significant. Patrick Sullivan, Manhattan rep to the PEP and fellow blogger here, who seems to understand data better than anyone currently employed by the DOE, questioned her closely, saying, "But by definition these are insignificant gains, no?" Which she, of course, fervently denied.

Indeed, the memo from the NCES was prepared in response to a highly misleading email that Klein sent to nearly every NYC resident after NAEP results were first reported. In his email, he falsely claimed that the results showed “good progress that is consistent with the overall picture” and said that the NAEP showed a narrowing of the gap in nearly all areas, when the data itself revealed quite the opposite.

According to NYC’s results on the state exams, the situation is more complicated. The achievement gap is narrowing in some areas when one looks at “proficiency” levels, that is whether a student is at a level 1, 2, 3, etc., but not in terms of the actual scale scores.

Some testing experts consider proficiency levels less meaningful than scale scores, as they can be arbitrary, subjective and easy to manipulate. Daniel Koretz, a professor at Harvard and a national expert on testing, has just published the must-read book of the summer, Measuring Up: What Educational Testing Really Tells Us. Here is what Koretz has to say about proficiency levels:

….the percents deemed proficient [on state tests] are largely unrelated to states’ actual levels of student achievement…are also often inconsistent across grades or among subjects in a grade....[This system] obscures a great deal of information [because] of the coarseness of the resulting scale.

Even more importantly, why should we trust the state scores at all, when we know that the gains that they purport to show are not reflected in more trustworthy measures like the NAEPs?

The best thing about the Koretz’ book is his lucid explanation of why “test score inflation” inevitably occurs when you attach high-stakes to exams, and how this undermines the integrity and validity of the results; this has increasingly been the case throughout the nation as a result of NCLB, but even more here in NYC, as a result of the increasingly high-stakes policies of the Bloomberg/Klein administration.

Steve Koss has written about this eloquently on our blog, in relation to Campbell’s Law: “The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor.”

I have tried to explain this phenomenon to many elected officials, staff, and reporters over the years, apparently with little success. I certainly don’t know a single NYC media outlet that has ever mentioned it, though Campbell’s Law was cited in some recent letters to the NY Times in response to the administration’s experiment to pay students for high scores.

I recall a lengthy discussion of this issue several years back with NY Times reporter David Herszenhorn while he was still on the education beat, to explain my opposition to the Mayor’s newly-proposed 3rd grade retention policy. One of the reasons I so vociferously opposed this policy, and still do, was not just that it was unfair to the student to base such a life-altering decision on the basis of one single, fallible test score, with such large margins of error; and not just that retention has been shown to have a racially-disparate impact and hurt rather than help most low-performing students.

My opposition was also due to the fact that the more significant consequences are attached to any test, the less its results can be trusted as a reliable gauge of real learning.

Since then, of course, the administration has piled on more and more high-stakes consequences -- for students, teachers, and schools – by adding fifth and seventh grade retention, awarding principals, teachers and students monetary rewards for high scores, and threatening to close down schools if scores don’t improve fast enough. The scores themselves have been rendered entirely meaningless as a result, as excessive test prep, teaching to the test, cheating, and other strategies to “game” the system has totally overtaken our schools.

Here is what Koretz has to say about this:

"One might expect that with the huge increase in the amount of testing in recent years, we would know more…Ironically, the reverse is true. While we have far more data now than we did twenty of thirty years ago, we have fewer sources of data that we can trust. The reason is simple: the increasing in testing has been accompanied by a dramatic upsurge in the consequences attached to scores. This is turn has created incentives to take shortcuts --- various forms of inappropriate test preparation, including outright cheating – that can substantially inflate test scores, rending trends seriously misleading or even meaningless.”

To the administration, all this appears acceptable, because as long as test scores go up, this justifies their policies; in truth, they don’t seem to care if the increases are meaningful or not.

Their laissez-faire attitude is revealed by the total lack of interest evinced in following-up on even well-documented cases of cheating. (See for example this story in the NY Sun, which though it says the DOE is “investigating” this will likely lead nowhere, as such stories have in the past.)

This “anything goes” attitude is also reflected in Klein’s remarks in a recent interview in the NY Post:

Q: What about complaints about the report-card grades for schools?

A: The report cards were probably one of the noisy periods. But . . . I can't tell you how many principals said to me, 'You know, chancellor, I didn't get the right grade but I promise you I won't get the same one next year,' so I think that had a big impact.

Here, Klein implicitly acknowledged that even while principals do not accept the fairness of these grades, based primarily on one-year gains in scores, he is content as long as they guarantee that they will get these scores to rise in the future, by any means possible.