Student test scores matter when evaluating teachers

This is part of a series from education blogger Laura Waters of NJ Left Behind.

One of the most contentious issues surrounding education reform is whether or not one can measure teacher effectiveness through student test scores. While other professions have long used objective data to measure performance– doctors get rated on patient outcomes, lawyers get rated on successful resolutions – rating teachers is considered far more complex and politically wrought. Historically, schools have relied on subjective observations, usually performed by principals or supervisors.

Teacher unions have fought to preserve this system on the grounds that using test data as a measure of effectiveness penalizes those who work with kids with disabilities, or those from impoverished homes, or those just learning English.

Test scores reflect teacher effectiveness

Now, a new study offers evidence that objective data can fairly evaluate teachers in spite of those differences.The Bill and Melinda Gates Foundation, which spends about $450 million a year on education programs, just completed a three-part, three-year study called “Measures of Effective Teaching (MET).” The third part, “Ensuring Fair and Reliable Measures of Effective Teaching” studied 800 teachers to determine what combination of data and classroom evaluations would most reliably predict teacher proficiency.

After analyzing student test results in a variety of districts from school year 2009-2010, researchers randomly assigned students to teachers (within their home buildings) the following year. Test scores were weighted for previous outcomes, degree of poverty, disability, or other factors that would affect student growth. Researchers often refer to this sort of data-driven evaluations as VAM, or value-added measures.

The results of the study show that student test scores are reliable measures of teacher effectiveness. The best combination, says the report, is to rely on test data for between 33 percent to 50 percent of a teacher’s total evaluation, although “composites that put 65+ percent of the weight on the student achievement gains on those tests will generally show the greatest accuracy.”

Classroom observations not enough

The use of classroom observations as a sole metric doesn’t work: “It is clear from these findings and the MET project’s earlier study of classroom observation instruments that classroom observations are not discerning large absolute differences in practice.”

No big deal, right? Student longitudinal growth – the increase in academic achievement from one year to the next – corresponds to a teacher’s proficiency in the classroom. Test data, coupled with traditional observations and student surveys, fairly evaluates teachers.

Gates study renews debate

Oh, very big deal. The reactions of pundits, politicians, and educators (from both the reform and anti-reform aisles) can roughly be divided into two camps: those who say the Gates study went too far and those who say the Gates study didn’t go far enough.

Here, as an example of the former, is educational historian Diane Ravitch: “VAM is junk science. The low ratings tend to go to teachers of ELL [English Language Learners], special education, and troubled kids. The scores, it turns out, measure WHO you teach, not teacher quality. VAM isn’t working anywhere, yet our nation will squander hundreds of millions, maybe billions, trying to make it work…Junk science is junk science.”

And, from the other side, here’s Jay P. Greene, education reformer, who suggests that the final report obfuscated one of the most important findings – that traditional classroom observations offer little in the way of meaningful assessment – out of fear of offending teacher unions and setting back a national movement towards more accurate teacher evaluations. “Classroom observations make virtually no independent contribution to the predictive power of a teacher evaluation system. You have to dig to find this, but it’s right there,” Greene writes.

Both Ravitch and Greene are tilting at windmills. Ravitch dons armor to fight the inevitability of accountability. Greene mounts his steed to rage against political expediency. But the evidence is there: used cautiously, student test data can be reliably correlated to teacher proficiency. Is it perfect? No. Is it better than what we’ve been doing? Yes.

New Jersey’s approach

New Jersey is implementing VAM through the mandate of teacher tenure and evaluation reform legislation passed last year. Our system – still evolving as districts frantically try to make the DOE’s September deadline – uses something at the low end of the Gates’ recommended 33 percent-50 percent. That’s enough to offend both the Diane Ravitch’s and the Jay Greene’s of the world. We’re both too radical and too timid, depending upon whom you ask.

Laura Waters is president of the Lawrence Township School Board in Mercer County. She also writes about New Jersey’s public education on her blog NJ Left Behind. Follow her on Twitter @NJLeftbehind.

Want a digest of WHYY’s programs, events & stories? Sign up for our weekly newsletter.

It will take 126,000 members this year for great news and programs to thrive. Help us get to 100% of the goal.