Questionable contract?

If you want to volunteer for our Citizens Contract Oversight Committee, or have a tip to share, please email us at

Thursday, June 9, 2016

Fred Smith on questions raised by the release of some test items from the 2016 3-8th grade state exams

I've looked at the information SED (Commissioner Elia) set forth last week on EngageNY.  It consists of a grade by grade release of 75% of the reading passages and multiple-choice items that appeared on the April ELA exams--by way of Pearson to Questar.  All of the material and questions that required constructed responses have been provided, as well. I did not look at the math test.

It is true that the amount of operational test material and the number of items disclosed is more than was given out in each of the prior three years of Pearson's core-aligned testing.  And since 2012, this is the earliest this has happened.  [Note: When CTB/McGraw-Hill was the test publisher during the NCLB years, the complete test was accessible to the public on SED's web site within weeks of its administration, along with answer keys. Item analysis data followed shortly thereafter.]

Upon review of the just-released spring 2016 testing output, however, certain useful data have not been made available. SED has been moved to offer us a translucent view of the exams, but it still is not being entirely transparent.

In order to make the SED information more accessible to reviewers, I re-cast it in the attached Excel workbook. It shows the name of the reading passages that were released, the type of item involved (M-C or CR), the sequence number of each item and the word count of each passage.

In addition, there are four separate measures of readability (also referred to by Pearson/SED as the "complexity metrics"). And beside these numerical indexes, there is a column called Qualitative Review, where the appropriateness of the material is judged.  The outcome of SED's review process was that all 55 ELA reading passages were deemed to be appropriate. The keyed correct answer to each M-C item is presented at the end.

 I have four misgivings about SED's presentation:

1- Twenty-five percent (25%) of the passages and multiple-choice items that counted on the tests remain unreleased.  This amounts to six reading selections and a combined 40 items that appeared in Book 1 on the operational exams given on April 5th.  Why should that be the case in light of the Commissioner's goal of providing more information?  Non-disclosure raises questions about the quality of the material being withheld. If you remember the contract, the 2016 operational material came from Pearson's item bank.  Since this is supposedly Pearson's last year and since NYS owns the material, why hold back?

As a consumer issue, how can SED justify that we, taxpayers, purchased a product we cannot see?  [And now that Pearson is on the way out, what about releasing the mounds of 2013-2015 test content that no one has been allowed to talk about?  Or did Pearson’s 5-year contract concede ownership of the still hidden material to the vendor?]

2- Item statistics have not been made available.  This kind of overarching data based on how the test population performed on each item is useful to researchers, analysts and anyone interested in seeing how the items functioned.  The items are the bricks that go into constructing the exams and on whose strength and quality decisions about children, teachers and schools have come to depend. 

Prior to Pearson, CTB/McGraw-Hill posted the item-level analytic data within months of test administration.  This included item difficulties (p-values = percentage of children choosing the correct answer) on multiple-choice items or the average score on constructed response questions.  CTB also showed how students responded to each distractor—i.e, the proportion of students choosing the wrong answers. Such data provide insights into possible weaknesses in the items (ambiguous choices, more than one best answer, distractors that are non-functional).  And a correlation was provided (known as the item discrimination index) showing the relation between performance by students on an item and their performance on the entire test. The expectation is that students who do well on an item also do well on the test.

This full set of statistics—referred to as classic item analysis data—ceased being presented after Pearson won the testing contract.  Since then, only some statistics have been provided—and more than a year after the operational tests have been given—ensuring that the exams could not undergo scrutiny until after Pearson’s next round of testing had taken place.

In the absence of empirical data, a vacuum that SED created, the department has been able to blunt criticisms of the exams—at first, dismissing them as anecdotal, and then, when the complaints became widespread, providing partial data (p-values and discrimination indexes) but well past the time SED had complete information readily on hand, yet didn’t make any available to those who might otherwise have had facts with which to challenge the exams. [Aside: I remember when a member of Governor Cuomo’s Task Force attempted to stifle Lisa Rudley’s critique that the Common Core Standards and core-aligned exams were being advanced without sound research data to prove their efficacy.  He pointedly asked where her evidence was to support complaints about flaws in the exams.  Given SED’s reluctance to dispense information, this was a preposterous question.]

The demand for complete timely data is not academic or trivial.  Let’s look at just-released Item# 37 from the Grade 6 ELA passage Weed Wars.  You may recall that Leonie Haimson came upon and brought to light information that one passage contained the confusing concept of “impossibly improbable”.  

Here now from SED’s rush to divulge 2016 information is the statement in question and the multiple-choice options 11-year olds had to choose from in answering.

Once in a while, changes to a weed’s DNA would allow that weed to survive the glyphosate.  The chances of changes like this were very, very small.  But when farmers used glyphosate years after year on millions of hectares1 of crops, “what seems almost impossibly improbable becomes more probable,” Duke says.

37.   What is the meaning of the phrase “impossibly improbable” as it is used in lines 21 through 23?

         A   usually certain

         B   highly unlikely

         C   extremely slow

         D   rarely noteworthy

This item won’t reach the game-changing level of ridicule that Leonie’s exposure of the Pineapple and the Hare did in 2012.  But it underscores the value of having statistics available to evaluate items.  What percentage of kids chose the correct answer (B)?  How did the distractors work?  That is, what proportion of children chose each of the wrong answers?  Having analytic data would enable us to see how this dubious item played out. 

According to the Learning Standards, #37 is coded RI. 6.4, which means it was classified as a Reading for Information item to see whether sixth graders can “determine the meaning of words and phrases as they are used in a text, including figurative, connotative and technical meanings.”

My guess is that most kids got the item right—making this an “easy” item—because the distractors seem implausible. I would mark it as a poor item for two reasons: It can be answered correctly without reading the passage from which it is drawn; and the distractors likely didn’t carry much weight. Of course, my hunch may be off.  Perhaps many children chose A in response to this convoluted question. We shouldn’t be left to speculate, however, in the absence of data that SED has in its possession.  Note: SED already has the statistics sought virtually as soon as the tests are scored or else it couldn’t issue the instructional reports it just distributed as referenced in Elia’s June 2016 letter to colleagues,

3- SED took away information it provided from 2013 – 2015 when it released questions with annotations.  The information was posted in August of those years in EngageNY. In this year’s zeal to reveal more items sooner, SED has not presented the statewide p-values for the items as it had over the last three years. Significantly, the annotations, which SED described as teaching tools, are also gone.  So, thus far this year SED has offered more items but has not included a rationale “to demonstrate why any of the released questions measures the intended standards; why the correct answer is correct; and why each wrong answer is plausible but incorrect.”  It is helpful to gain SED’s perspective about the material and its defense of the correct answer choice.  Unfortunately these explanations have not come out.  I think we should campaign to have SED’s annotations for the 2016 material released immediately,

4- SED failed to follow its own decision-making rules regarding which reading passages were appropriate to include on the operational exams.  Four ways to estimate the readability of potential passages were used in constructing the ELA tests: the Lexile Framework, Flesch-Kincaid, the Degrees of Reading Power and the Reading Maturity Metric (a Pearson measure). Each involves a scale that can be applied to reading material and sets forth a range that is appropriate for each grade.  For example, the Lexile Framework indicates that reading material ranging from 740L – 1010L is appropriate for 4th and 5th graders.  Ergo, material outside that band may not be right for children in these grades. 

SED and Pearson applied three of the four methods to each selection and said in releasing this year’s material that “to make the final determination as to whether a text is at grade-level and thus appropriate to be included on a grade 3-8 assessment, all prospective passages undergo quantitative text complexity analysis using three text complexity measures…. Only passages that are determined appropriate by at least two of three quantitative measures of complexity and are determined appropriate by the qualitative measure of complexity are deemed appropriate for use on the exam.

In reviewing SED’s latest data on the 2016 exams, I counted 11 of the 55 operational passages as failing to meet the criterion that they had to be found to be appropriate by at least two of the methods.  I don’t know how SED will resolve this contradiction.


Two final observations: The material just released by SED makes no mention of the Common Core Learning Standards—as had been the case in the Released Questions with Annotations of 2013-2015. Instead, SED has kept the boilerplate found in these releases and reverts to New York state p-12 Learning Standards as the framework it follows.  Nor could I find any reference to “college and career readiness. I guess they have been discarded due to the botched implementation of the Common Core.

Finally, I think we should press SED for the missing information outlined above and keep demonstrating how the department and commissioner continue to pose as being responsive, while taking business as usual actions.  Once we let up, they will fall back to disdaining that messy part of democracy—the will of the people.

- Fred Smith

1 comment:

Audrey said...

Id like to ask Fred Smith to evaluate and comment on this question. It's only one item, but every year that we have had access to the tests there have items have had overly plausible distractors or question stems that ask for one thing and reward another. In this case pvalues may work out, but nonetheless, every child who gets this question wrong for the right reason has been unfairly assessed as weaker by one question. I would love to get expert opinion on this question.