Although support for a voluntary student examination system has emerged from numerous quarters, one group—the New Standards Project (NSP)—has done much of the groundwork to make this proposal a reality. Last spring, for example, more than 7,000 4th graders in 17 states completed “performance tasks” in writing or math, which were then scored by teachers according to agreed-upon rubrics. Over the next three years, the Project plans to develop assessments in math, English, and science for grades 4, 8, and 10. Lauren Resnick, Director of the Project, and Warren Simmons, Director of Equity Initiatives, spoke recently about their efforts.
What's behind the New Standards Project; why is there a need for it?
Resnick: We believe that the American educational system has to begin to deliver on its rhetoric and see to it that all students are educated to high standards. The development of standards and assessments is a critical piece of reforming the entire educational system so that it is much more coherent and is driven by much higher standards. We are not simply a standards or testing undertaking; we want standards and assessments to help bring about better student outcomes—a different quality and higher level of student achievement.
Why do you think that creating more tests is going to have a powerful impact on classroom instruction?
Resnick: Because the evidence is just incontrovertible that whatever kind of test matters in the system has a heavy influence on classroom practice. In a way we should be celebrating that; don't we want teachers to work hard so that their students will achieve on the assessments that society says matter?
But we got ourselves into a Catch-22 in this country by using forms of assessment that weren't designed to be taught to. Teachers were told: “Raise the scores but don't prepare the kids for the test.” A great deal can be done by changing those tests to ones worth teaching to, as we hope ours are. For some teachers, a huge energy will be released. Of course, there will have to be encouragement, peer pressure, all kinds of support before teachers know what to do in this new environment. But there's every reason to believe it will work because we've seen the evidence. That testing drives instruction is usually pointed to as a negative. I believe it's a positive.
But only if the tests are quite different. What kind of assessments will your project use?
Simmons: We will be using portfolios that will contain a combination of on-demand and curriculum-embedded assessments; performance-based matrix exam tasks, projects, exhibitions; and work selected by districts, schools, teachers, and students. So exemplary work that a student has done in the classroom will be part of the evidence used in the NSP examination system. This will extend the number of indicators used to make judgments about student competence and increase their proximity to curriculum and instruction in the classroom.
How involved will classroom teachers be in all this?
Simmons: We're building an examination system that we want teachers to prepare students to do well on. To bring that about, teachers have to be intimately involved in the development of tasks, in the work associated with developing and refining content standards, scoring student responses, and so on. If we really want a closer connection between assessment, curriculum, and instruction, teachers and content specialists must be a major part of the enterprise.
That has not been the model used to develop traditional standardized tests. Those assessments are developed by testing experts and administered by teachers, and the results taken back and scored by machines. We're talking about building an assessment system that heavily engages teachers in task development, scoring, and using the results to improve curriculum and instruction.
Many experts agree that the performance-based assessments you talk about can enhance classroom instruction. But should they be used for accountability purposes as well? Should teachers or schools be judged by how well students do on these assessments?
Resnick: Yes, for a simple reason. You shouldn't hold people accountable for something they can't directly work on, as has frequently been the case in our education system. The accountability system has to use those assessments that you're willing to have kids keep practicing.
But surely traditional standardized tests are going to continue to play a role in accountability systems for some time. What's going to happen in the interim?
Resnick: One good possibility would be to begin a baseline assessment of the new kind and hold people accountable for it three or four years from now. And at the moment the baseline assessment begins, drop most or all of the old forms of testing. That makes the most sense, because to keep the old tests while you're trying to bring the new system aboard would make teachers wonder which direction to go. Some states look like they're willing to take this step.
An intermediate option would be to keep the previous forms of testing in place, but agree not to report the scores, or to report them only at a highly aggregated level, so that the pressure on the individual teacher is reduced. In that case, the old tests are still there in case the new system isn't ready in time.
I think it's going to have to be worked out state-by-state and district-by-district. The important thing is that the decision should be made at the lowest possible level of the bureaucracy. For example, it would be good if the federal government got out of mandating any particular kind of assessment for Chapter 1. The same thing can be said at the state level, where more decisions in the interim period might be left to local school districts. These are all options that will have to be sorted out.
Your project has begun testing different kinds of assessments over the past year. What are you finding?
Simmons: At this point it's premature to talk about the performance of individual students. At present we are exploring the best ways to develop high-quality tasks and scoring procedures that produce valid and reliable results.
And what's the big picture regarding the results of your efforts?
Resnick: Well, the big picture is we can do it. Logistically, it is possible to have teachers all over the country working with these assessments in roughly the same time frame. And this is essentially with no preparation, so we can be very optimistic that our basic way of doing assessments can proceed. Bigger logistical problems will arise as the Project becomes bigger, of course, but we really are talking about logistics, not fundamental feasibility.
The second major outcome concerns scoring; and here, again, the story is that we can do it. American teachers can judge with reasonable reliability. Now, this isn't really new. There's plenty of evidence that if you carefully set up rubrics and calibrate teachers to one another, you can get inter-judge agreements. So, in a way, what we've done is to replicate that. But we've done it under conditions that pushed the envelope, just as the assessment tasks themselves pushed the envelope.
Having said that, I need to point out that the reliability that we got in the pilot isn't good enough. We're aiming to assess individual children, and we do intend to apply a judgment about their work of “good enough” or “not yet good enough.” What we discovered in the pilot is that it's at this critical point in making discriminations among student work that the reliability is lowest. So we have to improve our methods to meet a different intended use of the assessment.
Simmons: The pilot taught us some other important lessons. We surveyed students and teachers about the assessment tasks we piloted. Their comments confirm that these assessments do get teachers thinking about learning, and about curriculum and instruction.
Students also said some interesting things. These new assessments require students to perform in ways that they are not accustomed to; to think, to express their opinions and ideas, and to reflect on and produce responses over a considerable period of time. Many students who have taken traditional standardized tests commented that they did not like our tasks because they preferred knowing in advance that there was a right or wrong answer, or a limited time to choose a response.
I'd like to talk a bit about the idea of “national standards” that's attracted so much attention recently. It seems that many people define the term “standards” differently. How do you use the term?
Simmons: We are dealing with three different types of standards: content standards, performance standards, and school delivery standards.
By content standards, we mean narrative descriptions of desired outcomes in various subject areas—the types of things represented in the curriculum frameworks produced by states like California, Vermont, and Connecticut. Performance standards result from defining and providing concrete examples of the level and quality of performance students must exhibit to show mastery of a particular area. When is the behavior or performance good enough to indicate that a student can apply the skills and knowledge framed by a set of content standards? School delivery standards are indicators of whether a school has the resources— for example, a challenging curriculum and qualified teachers— necessary to enable students to meet the performance standards.
Let's start with the content standards. How is what you're advocating different from all the previous attempts to better define some essential skills and understanding for all students?
Simmons: Well, one difference is that we've really never had content standards that were applied to all students. I remember surveying high school teachers in a school district where I worked about what they expected all students to know when they graduated from high school. Their most frequent response was: “It depends.” They meant that it depended on the particular program or track a student was in. So there's been no common understanding of desired outcomes for all students. This national movement for content standards, however, clearly intends these standards to apply to all students.
Resnick: It's a mistake to think we've had a planned set of discussions around what students should learn. What we've had emerged from text book companies' scope-and-sequence plans and from states and districts defining objectives. We did have an emergent set of ideas about what the content should be, so the content really hasn't been so different from one place to another. But we never had the discussion: unlike nearly every other country, we never had a passionate yet reasoned discussion about what was worth teaching and what students ought to know. That's changing as educators and others have begun to create national standards in various subject areas.
And is your project going to adopt the content standards created by these national groups?
Resnick: We'll use them as our operating starting point. But we'll also be informing the continuing process of revising and upgrading these national content standards. So, where they exist—only in mathematics presently—we'll be in a feedback loop. They don't yet exist in other subject areas, so we'll use the earliest versions that become available. In the case of English, the people in the New Standards Project working in this area are essentially the same people who are developing the English content standards for the country. So we'll be keeping in continual communication with the national process.
So you're advocating one set of content standards for all students. Is that true of performance standards as well? Is one standard of performance what you're striving for?
Resnick: This has to be answered carefully, because it's complex. The idea is that there will be one standard for all different “classes” of Americans: ethnic groups, social classes, socioeconomic groups, speakers of different languages, and so on. In that sense we are aiming for a single standard, and we aim to make it a very high standard.
In the pilot test, across all the different tasks—and allowing for the fact that we don't yet have scoring reliability quite where it needs to be—only about 25 to 30 percent of students met a criterion that our people would be willing to call passing.
In establishing that standard (just assuming for now that's the example of standard), we're only getting about a quarter of our kids who can pass it. Not even the kids labeled as gifted and talented tended to perform at the very top of the scales that were established. The percentage getting a score of 6 on a scale of 1 to 6, for example, was only about 5 percent. So we've got a very big issue of raising the overall standard. What we expect to establish is passing standards of very high minimums, you might say, and then honor standards beyond that.
How do you ensure that these performance standards aren't just arbitrary cutoff scores?
Resnick: What we're setting is not arbitrary. What we've done is to bring together some of our leading teachers and other educators—and by the time we're finished, whole communities and the whole nation will participate—to decide in a substantive way what's good enough. People will make a judgment, but it will be a judgment about what kinds of performance we expect from students. Then, 25 percent of students may make it, or 50 percent, or 95 percent. There won't be any “Lake Woebegone” effect, because we're not asking for a standard of 50 percent above or 50 percent below.
Many educators seem concerned about the standards movement because they think it's unfair to hold students to a high standard if their opportunities and backgrounds are so varied. How do you respond to this?
Simmons: That argument ignores the fact that students are held to high standards by someone. If they are not held to high standards by schools, they're certainly going to be held to high standards by employers, by their communities, and so forth. So the idea that educators are doing students a disservice by holding them to high standards is a fallacy. Right now, many students are finding that the diploma they earned is not meaningful because it does not provide them access to opportunities in society.
We have to shift the discussion away from whether we should hold students to high standards and toward what resources we need to get everybody to meet these high standards. This is where the delivery standards come in. They focus attention on whether or not students have had the opportunity to meet high standards.
In the past, I think we mistakenly decided to wait until we got the resources to hold the students to high standards. The students are paying a price for that delay.
But if students fall far short in meeting these standards, couldn't that failure be used to justify denying schools more resources?
Simmons: I don't think so. Not holding young people to higher standards, in fact, weakens public support for additional resources. Right now, the public is convinced that the bad news they hear about student achievement applies to someone else's child, not their own. So they don't understand our arguments for more resources.
Once we articulate our standards and show clear evidence of what we need in the way of resources and opportunities to get the majority of kids there, then I think the public is going to be much more sympathetic to our arguments about resources. But right now, all we have to offer the public are dropout rates and promotion rates as reasons for additional millions of dollars—and those haven't been persuasive arguments in the past.
There's been a considerable amount of discussion lately around the idea of a national examination system, as opposed to a national test. The actual workings of such a system seem very complex. What would such a system mean in the context of your project?
Resnick: Let's assume we've got agreed-upon content standards that are embodied in the national content standards. And we also have agreed-upon performance standards embodied in collections of student work that have been judged by a public process to be “good enough,” “not good enough,” or “honors.” So we have some benchmarks. We'll know they are all up to the standard because we went through a public judgment process to make that decision. Let's assume that's sitting there.
Now, suppose I'm in the state of Colorado and I have an assessment. I want to know whether what I call good enough in writing or in mathematics is up to the standard of this national process. Will my benchmark figures fit in the national set?
There are two or three ways you can go about figuring this out, and we'll be working on all of them until we determine which works best. One step that seems absolutely essential is for a body of people who have the public's confidence and who have Colorado's confidence to look over Colorado's assessment. They might have to visit Colorado and see what it looks like in the actual schools. In any case, they look at whether the content of the Colorado assessment is consonant with the national content standards.
If it is, then you can go to a couple of next steps. One is that Colorado might have agreed to include in its own assessment some tasks that were drawn from the national benchmarking pool. And you look at the relationship between how Colorado kids perform on the benchmark tasks and how they perform on the rest of Colorado's assessment. That's one method. Sweden is using another method. At the end of 9th grade, all kids take a national exam in math, Swedish, and English. The teachers grade the exams, and the teachers also have given students their grades based on the last two or three years. A sample of the exams go to Stockholm to be regraded. Then what's looked at is the distribution of the school in general, their grades in general, versus the distribution of that school on the sample of papers that were sent to Stockholm. If the distributions match well enough, then the school's grades hold and there is no further inspection. So those are the two basic models that we'll be exploring.
In the pilot test administration, you tested several tasks. Will the New Standards Project actually be developing complete assessments?
Resnick: Yes, this spring we pilot full-scale matrix assessments in math and English language arts. By 1996, a portfolio system capable of yielding individual student scores is expected to be ready.
What do the partners in the project actually get for their involvement in the project?
Resnick: We don't turnkey a test: they don't give us X dollars so that we'll deliver a completed test a year and a half later. We do, in effect, turnkey a process, though. By working with us, the partners will be developing their capacity to create and score tasks, to train other people to develop and score tasks, to develop scoring procedures, to develop instructional programs that go with these new kinds of assessments, and so on. All of this is linked to their own curriculum standards, and they will be participating in the processes of public engagement in standards setting. So, in fact, what they get and what they give are intimately connected.