Cohen: The trouble with test scores

Credit: Illustration by Christopher Serra

By FRED COHENJanuary 20, 2012

Fred Cohen, retired deputy superintendent from the Bellmore-Merrick CHSD, is a data analysis consultant with Nassau BOCES and is training county administrators and supervisors on new requirements for evaluating teachers.

With these five words -- "No evaluation, no money. Period" -- Gov. Andrew M. Cuomo has raised the stakes in the teacher evaluation controversy. Unless districts and teachers agree to base 40 percent of a teacher's evaluation on student test scores, promised increases in state aid will be denied.

But are test scores a fair measure of a teacher's performance? Proponents say that better instructors get better results and, even if this measuring tool is imperfect, the state Education Department will still weigh supervisor observations as 60 percent of a teacher's rating.

Proponents also claim that measuring teacher performance by students' assessment results is objective -- it's fair and accurate. Further, though state assessments may not be perfect measures of student performance, they do include both multiple choice and extended-response questions and are based on state curriculum standards. And most important, teacher evaluation scores will be based not on student achievement, but on the growth of each instructor's students compared to the growth of similar students with the same initial scores statewide.

This certainly sounds like a level playing field. So why have approximately 700 New York State principals -- 500 from Long Island -- signed a letter protesting the use of this growth measure for evaluating their teachers? Why have leading groups of educators and researchers on the national scene claimed that teacher growth or "value added" measures are flawed and unstable -- varying so widely from test to test or year to year that they cannot be accurate?

These researchers cite numerous factors or variables beyond teachers' control that may upset test results. Factors may be as obvious as student demographics -- family income or educational level of parents -- or as seemingly innocuous as the time of day of the instructional period. Imagine a teacher, perhaps in an affluent school district, who is so weak that parents hire tutors to provide the instruction missing in the classroom. Who'll be credited with the students' growth on the state tests? Will this teacher be fairly evaluated?

Similar examples of random error -- what researchers call statistical "noise" -- abound, but state Education Department experts claim that complex statistical formulas can account for these variables. As someone who has spent a decade analyzing standardized test results for school districts looking to improve instructional practices, however, I can say with some surety that there are indeed unaccounted for irregularities.

Consider these two examples, which I have seen with some frequency.

The first is inappropriate assistance during the proctoring of an exam. Under current regulations, teachers -- especially in elementary school -- may be alone in their rooms with students for the entire test. Any assistance, whether overt or subtle, not only contaminates results for the offending teacher, but it also affects the growth score for the instructor unlucky enough to teach those students the following year. Improvement will be highly unlikely because the previous year's test scores were achieved using the equivalent of instructional steroids. Depending on the amount of inappropriate assistance, test scores may drop precipitously the following year. And as test scores now become part of a teacher's evaluation, will these higher stakes provoke greater temptation to improve test scores?

A second more innocent practice also creates an uneven playing field. Most districts do their best to grade essay-type questions according to the Education Department's established guidelines. Yet, at times, teachers -- with the best of intentions -- grade students' responses with greater rigor than required. The result can be a district whose students outperform their peers on every multiple choice question, yet who appear to underperform on every extended-response question. Although it's theoretically possible that teachers in the district taught only those skills related to the multiple-choice questions, it's far more likely that an overly strict interpretation of the rubric -- the state's recommended grading guidelines -- was the cause of the imbalance in test results.

In the past, grading anomalies like this have affected student scores but not teacher ratings. Now they will have unintended effects on teacher ratings, too. Imagine a district that graded the 2011 seventh grade English Language Arts extended-response questions too severely. If the 2012 eighth grade ELA assessment is now graded with the same rigor as everyone else's, undeserved growth will be credited to the eighth grade instructors. And, of course, the reverse would be true if the 2011 teachers had been too generous. The teacher evaluation score will actually be measuring a change in grading practices, not measuring effectiveness in the classroom. An entire grade level of teachers in this district will receive inappropriate evaluation scores.

This is not random statistical error; rather, it is built-in systemic error -- an unaccounted-for variable -- that calls for correction.

As a longtime supporter of the concept of using student growth as one part of evaluating teachers, I trust that the state Education Department will eventually tighten procedures to control the inequalities that add to the instability of teacher growth measures.

One fix seems easy -- adding another proctor to the room for exams, for example -- but is not feasible under current staffing and budget constraints. The second fix -- regional scoring of tests by teams of teachers from multiple districts -- was once a common practice and will likely be considered again in the future.

These problems are solvable, but we're not there yet. Until these systemic issues are resolved, translating teacher performance to a numerical rating will only produce further resistance by thoughtful people. At present, results should be viewed as purely advisory. The governor's and public's urgency to get teacher evaluation in place immediately should not outweigh the need to get it right.

Stay logged in.