Scores on value-added measures have not helped these New York teachers teach more effectively.
Teacher evaluation systems are evolving quickly in response to changing mandates. Many systems have multiple components, combining observational measures of instructional practice with measures of student learning. These components are often uneasy companions in the dual pursuit of two separate goals: improving teaching practice; and judging teachers as either competent (and thereby worthy of tenure, additional responsibilities, or supplemental compensation) or ineffective (and thereby slated for corrective action or termination).
It's important to ask how these new evaluation models are serving teachers as they strive to continually improve their practice. Do teachers change their teaching practices in response to what they learn through a teacher evaluation system? In 2014, we explored this question by interviewing 13 New York City teachers and the principal of a school we'll call the Frank Lloyd Wright School about their reactions to one specific component of their teacher evaluations: value-added measures of their performance (VAM).
A Little Context
Before we examine these teachers' reactions, let's consider the context. New York City introduced a new state-mandated teacher evaluation system in the 2013–14 school year. In this model, 20 percent of the annual evaluation of English language arts and/or mathematics teachers in grades four through eight is based on a VAM score derived from their students' performance on Common Core-aligned tests. The state-provided statistical model generates a mean growth percentile that indicates where each teacher's students rank on state test performance as compared with students across the state who have similar academic histories and characteristics.
The New York State Education Department first calculated VAM scores for teachers on the basis of these mean growth percentile scores for the 2011–12 school year. Although many teachers across the state, including those in New York City, received these scores, the scores didn't have consequences that year because the statewide implementation of the new evaluation process—termed Annual Professional Performance Review—didn't begin until the 2012–13 school year. Implementation in New York City was delayed an additional year (until 2013–14), because it took that long for the school district and teachers union to agree on an evaluation system.
Under the Annual Professional Performance Review system, each teacher receives a summary evaluation based on state-approved and local measures of student performance (including the teacher's VAM score), classroom observations, and other measures. The score categories for this summary evaluation are Highly Effective, Effective, Developing, and Ineffective. Teachers who receive a summary rating of Ineffective for two consecutive years are subject to an expedited dismissal process.
Although the 2012–13 school year was the first time VAMs had been used in formal evaluations, they weren't new to New York City teachers. Beginning in the 2007–08 academic year, the city's Department of Education produced Teacher Data Reports, which ranked teachers using a value-added model that generated a percentile ranking (roughly interpretable as the percentage of teachers whose contributions to their students' performance on the state tests were below that of the given teacher).
After a protracted debate in the courts, the New York City Department of Education released to newspapers and other news outlets the 2007–2010 Teacher Data Reports for approximately 18,000 teachers. Most of these media outlets published lists of individual teachers and their scores. Some teachers were anointed the worst teachers in the district, and pilloried in the media (Hancock, 2012). These reports weren't used for high-stakes individual evaluations, but their publication left teachers feeling vulnerable.
Reactions to VAM Scores at a High-Achieving School
The Frank Lloyd Wright School, the site of our study, is a large elementary school that mainly serves white, affluent students. Most students perform at or above grade level, and one teacher noted that "the kids come to school engaged and excited to learn." Wright parents are deeply involved in the school and their children's schoolwork, and the school's PTA ensures that teachers have all the resources they need.
Teachers uniformly reported that Wright "is a family" and that they were happy teaching at the school. Many contrasted their experience at Wright with less positive experiences teaching in other schools. Teachers viewed their colleagues as highly collaborative and supportive of one another. They noted that they planned lessons with colleagues and shared materials and lesson plans in a Dropbox so everyone could access them. These teachers believed that their principal set the tone for the school and their relationships with one another. Teachers reported that they trusted the principal and valued her instructional leadership skills and her educational vision. They felt that she shared decision-making authority with teachers and that it was safe to "speak up."
A Lack of Legitimacy
These teachers placed little importance on their VAM scores. For one thing, they believed that the tests on which VAMs are based lacked legitimacy as measures of students' (and teachers') performance. The educators at Frank Lloyd Wright had a strong shared vision of good educational practice which valued students being actively engaged in learning and having authentic experiences with content. The school did little test preparation, and teachers reported that they would be frowned upon if they spent classtime this way.
Teachers overwhelmingly felt that the New York state tests were not a good measure of the kind of learning they were trying to advance. In some cases, they noted that these tests directly opposed what they were trying to teach students. Rebecca provided an example of such a conflict with the English language arts tests, which were aligned with the Common Core standards. Test items frequently required students to read a short story and respond to a series of questions about the story. A question might ask a student to describe why the setting is important to the story and to provide two details from the story to support that response. Rebecca felt that superficial tasks like this overshadowed the quality of students' writing. She noted, "I feel like everything we've taught them about good reading and writing, when they take the test, we're telling them not to do that. It's very frustrating."
Because teachers at the Wright School didn't consider the tests from which VAMs were calculated to be good tests, they largely didn't assign any meaning or importance to value-added scores, and VAM didn't become a topic of conversation between teachers. Christina told us that this choice reflected the school's core values:
There's a lot of conversation [at this school] about how testing is not a way to assess whether you're a good teacher. And it's also not a true assessment of how much the kids learn. … I guess people don't talk about [VAM] because they don't want to give it credit [as] something they should talk about.
These teachers weren't worried that a low VAM score would cause them to receive a negative or unfair evaluation, partly because they trusted their principal. Many noted that they would feel significantly more anxious about the system if they were working with administrators who were less trustworthy.
No teacher we interviewed entirely understood the basic principles of how VAM scores were estimated. Most admitted this lack of understanding. Heather complained that the system used "the most complicated formula of all time," and when we asked Elizabeth to explain how VAM work, she replied, "I don't have a clue." Teachers such as Heather and Elizabeth weren't troubled by this and expressed no interest in understanding VAM better. To them, the measure wasn't legitimate in the first place, so how it was calculated was irrelevant.
Teachers made numerous references to wild swings in teacher and school VAM scores from year to year, such as a teacher going from the bottom of the distribution in one year to the top in the next despite doing nothing different in the classroom. In general, teachers saw VAM ratings as unstable or inaccurate.
"It Doesn't Drive My Teaching at All"
No teacher we interviewed believed that their VAM score would help them improve their practice. Because teachers felt they didn't have control over their VAM scores, they didn't know what they could do differently to improve them. According to Jessica, these data are, "like the weather. … Even after seeing [my students'] scores, I have no idea how this is going to come out." Some teachers reported that they had chosen not to look at their VAM score from the previous year at all.
Stephanie pointed to the absence of information about specific instructional weaknesses:
It doesn't drive my teaching at all. It makes no difference in how I teach. Teachers never get to even see the test. If I got to see it and see where my students had trouble … that would be helpful.
Similarly, Melissa said:
I don't think it can give me any feedback, unless it itemized every single question and gave me the standard that it matched to and gave me a percentage of my kids who missed, say, 5.03A on ELA. I wouldn't learn anything from it unless I saw that [for example] none of my kids can infer, none of my kids can find an idea.
From these teachers' perspectives, the test data might have had value if they had enabled teachers to disaggregate student performance across skills and make inferences about specific areas needing improvement. The VAM data didn't allow for this possibility.
When VAM Has Serious Consequences
Besides asking whether scores based on value-added measures can help teachers improve, it's important to consider whether such scores might have harmful consequences for teachers' growth and professional lives. Clearly, the teachers at the Wright School—who were in the fortunate position of teaching in a highly regarded school with high-performing students and a supportive principal—didn't take VAM seriously. But most teachers aren't teaching in such schools. And there have been many instances of teachers being shaken up and facing serious consequences because of a low VAM rating.
For example, Carolyn Abbott, a mathematics teacher at a New York City gifted and talented school, was rated the worst 8th grade mathematics teacher in the city on the Teacher Data Reports, the city's precursor to the statewide VAM calculations. She described herself as "angry, upset, offended" and "humiliated" by the publication of her ratings in the local media. The experience influenced her decision to leave teaching and pursue a doctorate in mathematics (Pallas, 2012).
Similarly, Sheri Lederman, a 4th grade teacher in Great Neck, New York, was shocked and angry when she received a 2013–14 VAM rating of 1 out of 20, which classified her as Ineffective. The year before, she had been rated Effective on the VAM, and she was also rated Effective the next year, in 2014–15. But the official state designation of Ineffective would follow her forever. Sheri thought seriously about quitting teaching. Instead, she sued the state, successfully arguing that she had standing to challenge the rationality of New York's VAM system in court.
Great Neck, Long Island, is a wealthy community with high-scoring students, and both Carolyn Abbott and Sheri Lederman had support from the administrators, parents, and students with whom they worked. But Elizabeth Morris (a pseudonym) taught in an upstate school district that couldn't point to terrific student outcomes to offset the stigma that might accompany a low score. Although Elizabeth had been awarded tenure, the VAM score of 5 out of 20 she received in 2011–12—and the accompanying rating of Developing—were jarring and at odds with her conception of herself as a skilled teacher. Elizabeth, an outspoken and independent teacher, felt vulnerable. She wrote to one of us,
Numbers are so very important in this culture. It has taken everything I've got not to brand my own self as a failing teacher in the face of my VAM, and I haven't been successful yet. … I fear that an administrator coming into my classroom who is aware of my VAM cannot objectively evaluate my performance. Instead, having decided subconsciously or consciously a priori … that I am a "Developing" teacher, they will be seeking to determine the weaknesses that caused my score, whether those weaknesses actually exist or not.
The following year, Elizabeth was rated Effective both on the VAM measure and overall. Reflecting on her experience, she said,
I don't mind saying that last year was the worst year of my professional life. I was the most stressed, the most disorganized, the most disheartened, and wandered furthest from the reasons I entered the profession in the first place. It is inconceivable to me that this was the year I "improved" (Pallas & Morris, 2014).
After this experience, Elizabeth—who'd been a classroom teacher for more than a decade—left classroom teaching.
Potentially Counterproductive
These various teacher reactions to their value-added scores indicate that these methods are at best unhelpful and at worst dangerously demotivating. Although we found that VAM scores didn't affect teachers' sense of self-efficacy at the Wright School, our conversations with teachers at other schools have made clear that low VAM ratings can be demoralizing, especially when teachers already feel vulnerable. The volatility in VAM ratings from one year to the next only partly compensates for this; a year is a long time to worry about being labeled less than fully competent. Even in the Wright School, teachers noted that a low VAM score might lead them to switch schools, change grade levels or subjects, or even consider whether they wanted to stay in teaching.
The primary barrier to teachers using value-added data to improve their practice is their belief that there's no information in the data that are worth taking seriously or that can support a change in teaching practice. It wasn't clear to teachers what the next step should be after receiving a low score, which would likely be true even if they understood how these measures are constructed. As one teacher explained, "there's nothing to look at" in a VAM to improve teaching. In the absence of data about specific skills on which their students lagged, teachers were at a loss. Further, when VAM scores fluctuate wildly from year to year or teachers are held accountable for students they never taught, value-added measures seemed entirely out of teachers' control.
It's worth asking what kind of feedback would be helpful to teachers. In contrast to their view of VAM scores, teachers reported to us that they found classroom observations helpful in providing actionable feedback on their teaching in real time—so they didn't have to wait until the end of the year to make adjustments. But a great deal hinges on the extent to which teachers trust and feel supported by their principal. As we've noted, teachers at Wright praised and trusted their principal. That trust enabled them to accept the feedback accompanying her classroom observations as constructive, rather than punitive. So it's important that schools work to create this climate of mutual trust and support.
What's clear is that the link between VAM data and actions a teacher might take to improve is severed in most teachers' minds. This new technology has failed to achieve its promise—and may even be causing harm.
Authors' note: The names of teachers at the Frank Lloyd Wright School are pseudonyms.
References
•
Hancock, L. (May, 2012). When big data is bad data: The press and standardized testing numbers: A cautionary tale. Columbia Journalism Review.
Pallas, A. M., & Morris. E. (2014). Pen pals. Impact on Instructional Improvement, 39(1), 11–19.
End Notes
•
1 There is a temporary moratorium on the use of this state statistical model in the teacher evaluation system, but the scores are still calculated and disseminated to districts, principals, and teachers.