With standardized writing tests here to stay, educators would do well to learn how their students' writing will be scored and how they can apply assessment techniques in their own classrooms.
In the United States, policymakers, advisory groups, and educators increasingly view writing as one of the best ways to foster critical thinking and learning across the curriculum. The nonprofit organization Achieve worked with five states to define essential English skills for high school graduates and concluded thatStrong writing skills have become an increasingly important commodity in the 21st century. . . . The discipline and skill required to create, reshape, and polish pieces of writing “on demand” prepares students for the real world, where they inevitably must be able to write quickly and clearly, whether in the workplace or in college classrooms. (2004, p. 26)
The emphasis on writing in the curriculum has been accompanied by the rapid growth of writing assessment. My middle daughter, who recently finished 8th grade in a Pennsylvania public school, has known what a scoring rubric is since 4th grade, two years before she had to take her first official state-mandated writing assessment. My youngest daughter, still blissfully naïve in 2nd grade, has yet to hear the term at school but will surely encounter it next year when her teacher begins talking about the Pennsylvania System of School Assessments.
My daughters are not alone. Increasingly, students are being asked to write for tests that range from NCLB-mandated subject assessments in elementary school to the new College Board SAT, which will feature a writing section beginning in March 2005. Educators on the whole have encouraged this development. As one study argues,Since educators can use writing to stimulate students' higher-order thinking skills—such as the ability to make logical connections, to compare and contrast solutions to problems, and to adequately support arguments and conclusions—authentic assessment seems to offer excellent criteria for teaching and evaluating writing. (Chapman, 1990)
But what constitutes authentic assessment of student writing? The answer varies across state lines. Although most state assessments are linked directly to state standards and judge writing according to such elements as purpose, organization, style, and conventions (including grammar, usage, and mechanics), states diverge in what skills they test and how they test them.
Some state assessments more fully simulate the writing process than others, incorporating the brainstorming, first draft, editing, final draft, and proofreading stages. Many states follow the National Assessment of Educational Progress model that assesses different writing genres—typically narrative, expository, and persuasive. Some states assess writing as a specific part of their exit exams (California); others include a high school-level writing assessment as part of the overall graduation requirement (Florida). Some states use such indirect measures as multiple-choice items on grammar, usage, and punctuation (Maryland), whereas others rely more heavily on direct, or authentic, writing (New Jersey and Massachusetts). In Wisconsin, two rubrics are used to score students' papers: A composing rubric measures students' ability to write organized prose directed clearly and effectively to an audience, and a conventions rubric measures students' ability to apply the conventions of standard written English.
Despite the broad divergence in the writing skills they measure, most U.S. states have one thing in common: Almost every one of their K-12 education systems has, or will soon have, some kind of direct writing assessment and accompanying scoring guide. With this significant increase in standardized writing assessment, educators would do well to learn just how the hundreds of thousands of essays written each year are scored. After briefly discussing the history of U.S. writing assessment, I will describe the most common scoring method used today, how technology is changing assessment, and how educators can hone their students' writing skills by applying assessment practices in the classroom.
Overview of Writing Assessment
Writing assessment in the first part of the 20th century favored lengthy essay examinations, but the need for efficient mass testing in the 1940s led to the development of multiple-choice tests. Multiple-choice writing questions set test takers such tasks as identifying errors in usage and grammar in a writing extract. These tests have logistical advantages in terms of efficiency and cost of scoring, and their results seem to correlate broadly with writing ability. They lack validity, however, in failing to assess test takers' ability to develop an individual piece of original writing. Beginning in the 1970s and continuing to the present day, large testing programs have increasingly opted to test writing by combining multiple-choice questions with at least one essay-writing section.
The most common method currently used to score the essay sections of such tests is often termed modified holistic scoring. Holistic scoring of timed, impromptu responses to general writing topics is a relatively recent phenomenon in assessment, coming into prominence in the 1970s with the advent of large-scale, high-stakes, timed writing assessments. Although far from embraced by many members of the greater writing community, holistic scoring continues to dominate major writing assessments today.
Holistic Scoring
The term holistic in this sense comes from the idea of a wholistic evaluation of the response; scorers do not quantify strengths and weaknesses but rather give a score based on their first impression of the overall quality of the writing. Modified holistic scoring combines norm-referenced scoring (judging how well each individual response is written compared to others) with criterion-referenced scoring (evaluating responses according to specific, defined criteria) and has developed into a relatively reliable and valid assessment (Cooper & Odell, 1977). The challenge has been scoring the increasingly large number of responses accurately and efficiently.
For most current major writing assessments, the process begins with developing writing prompts that are equally accessible to all prospective test takers (Boomer, 1985; Murphy & Ruth, 1999; Ruth & Murphy, 1988); no student should be penalized or rewarded because a prompt requires special knowledge or outside experience. Test developers generally pilot-test a variety of writing tasks to determine which ones most appropriately measure the construct being assessed. They then pre-test the selected prompts, ensuring that each elicits a variety of responses at different levels. Finally, they use the pilot and pre-test results to finalize a cross-prompt, generic scoring guide.
Scoring guides present hierarchically arranged descriptors at each point of the scoring scale that define distinct levels of criteria by which readers can judge the quality of a given response. For example, one paper might be judged on a four-point scale (Advanced, Proficient, Basic, and Below Basic) according to five discrete criteria: insightful reasoning, clear and useful examples, sophisticated organization, good sentence variety, and general facility with the conventions of standard English.
The next stage is defining and hiring the population of readers appropriate for a given assessment. This group often includes a certain percentage of teachers at the grade level of the test takers. Readers are trained with a scoring guide that clearly defines the score points, and they review exemplar papers that provide specific examples of responses at each score point. These papers “calibrate” readers to the scoring guide, ensuring that they are all bringing the same perspective and standards to the scoring process.
After the test developers and the readers develop a consensus on how the responses will be scored, the operational reading can begin. Typically, two different readers independently score each response. After the first reader assigns a score to an essay, the second reader (who does not know the first reader's score) reads and scores the response. Scores that are identical or adjacent are typically averaged to produce the final score. When the scores are more than one point apart, the essay usually goes to a third reader, whose score is averaged with the closer of the first two (or with both when it is exactly between and adjacent to both of the first two scores).
Holistic scorers read the entire essay relatively quickly and give a single score on the basis of their evaluation of the overall quality. One Educational Testing Service test developer likened scoring tests in this way to answering the open-ended evaluative question, “How was your flight?” Many identifiable criteria define the success of a typical commercial airplane flight: the smoothness of takeoff, the expertise of the flight attendant, and the arrival time. Although all these criteria may help determine a passenger's overall evaluation of the flight, some factors (a safe landing) are more crucial than others (the quality of the in-flight movie). Readers should be able to develop a reasonable consensus on the hierarchy of importance of the various scoring criteria and remain aware that extremes in one or more of the individual criteria may end up influencing the overall judgment.
Evolving Technology
Technology is changing the way we assess writing. The traditional paper-and-pencil scoring continues to flourish and will most likely continue for years to come, but many large-scale, high-stakes assessments are now being scored both on and by computers.
In 1997, Educational Testing Service began using the Online Scoring Network to score essays for the Graduate Management Admission Test. In this system, test takers' responses are sent electronically to trained scorers, who read the essays online. Before being certified to score the essays, readers are trained in holistic scoring and learn about the specific assessment they will be scoring. After practicing scoring sample responses from previous tests, readers take an online certification test that determines whether they qualify to score the assessment. Once qualified, readers need to successfully score a calibration set of approximately 10 papers before each scheduled scoring session begins; until they are able to demonstrate accurate scoring on these papers, they cannot begin scoring operational papers.
The scoring system includes a fixed percentage of prescored responses, or monitor papers, that demonstrate at a glance whether or not readers are scoring accurately. This immediate access to analyses of readers' scoring performance is a major advantage of the online scoring system.
Two years after implementing the Online Scoring Network, the Graduate Management Admission Test began to implement automated essay scoring with software developed by computational linguists at Educational Testing Service. Termed e-rater, this software “trains” on hundreds of human-scored responses at the various score points and determines an essay's score by looking at such criteria as development, organization, content-relevant vocabulary, grammar, usage, and style. Currently, all Graduate Management Admission Test essays receive one score from e-rater and one score from a human.
The ever-increasing role of technology in composition pedagogy and assessment has received a mixed reaction. Some educators have embraced automated essay scoring for its convenience and seeming objectivity. Many in the writing community, however, worry that this kind of scoring may lead to more mechanistic and less creative or thoughtful writing, and therefore caution that these new technologies should be used only to support—never to replace—final human judgment of student writing.
Applications in the Classroom
Assessment specialists talk about washback—the effect that tests have on curriculum and pedagogy. To some degree, teachers will inevitably teach their students the skills on which their students will be assessed. Like it or not, standardized writing assessments are becoming more and more widespread. Although many writing specialists would like to see more use of portfolios for assessment and placement, logistics have made timed, impromptu writing assessment the norm. Teachers can participate proactively in these assessments by bringing into their classrooms their knowledge of the principles of modified holistic scoring.
For example, teachers can help students participate in the assessment process by involving them in the same exercises scorers engage in (Duke & Sanchez, 1994; Fiderer, 1998; Skillings & Ferrell, 2000; White, 1982). In preparation for Pennsylvania's state-mandated tests, for example, students learn to evaluate their own and their peers' writing by following the official rubrics for focus, content, organization, style, and conventions on a four-point scale: Advanced, Proficient, Basic, and Below Basic. For another assessment activity, teachers can distribute three representative high, middle, and low responses to a given assignment—preferably ones written by students in the previous year's class, presented anonymously. Students must rank the responses and then articulate what features of each response helped define that ranking.
The use of automated essay scoring software can enable teachers to assign more writing by having students submit their first drafts for computerized scoring and diagnostic feedback. After a student has revised his or her essay on the basis of the automated scoring, that piece of writing should be relatively clean and error-free. Teachers can also use the software to conduct student-paced drills in specific elements of the basic conventions.
More and more, teachers are bringing assessment techniques and methodology into their classrooms. Using assessment as an instructional tool means students can generate a scoring rubric of their own—one that does not arrive from some external faceless bureaucracy but that is grounded in students' own values and critical judgment. Such a process fosters critical thinking and formative self-assessment—abilities that will serve students throughout life.
References
•
Achieve, Inc. (2004). Do graduation tests measure up? A closer look at state high school exit exams. Washington, DC: Author.
•
Boomer, G. (1985). The assessment of writing. In P. J. Evans (Ed.), Directions and misdirections in English evaluation (pp. 63–64). Ottawa, Ontario, Canada: Canadian Council of Teachers of English.
•
Chapman, C. (1990). Authentic writing assessment. Washington, DC: American Institutes for Research. (ERIC Document Reproduction Service No. ED 328 606)
•
Cooper, C. R., & Odell, L. (1977). Evaluating writing: Describing, measuring, judging. Urbana, IL: National Council of Teachers of English.
•
Duke, C. R., & Sanchez, R. (1994). Giving students control over writing assessment. English Journal, 83(4), 47–53.
•
Fiderer, A. (1998). Rubrics and checklists to assess reading and writing: Time-saving reproducible forms for meaningful literacy assessment. Bergenfield, NJ: Scholastic.
•
Murphy, S., & Ruth, L. (1999). The field-testing of writing prompts reconsidered. In M. M. Williamson & B. A. Huot (Eds.), Validating holistic scoring for writing assessment: Theoretical and empirical foundations (pp. 266–302). Cresskill, NJ: Hampton Press.
•
Ruth, L., & Murphy, S. (1988). Designing tasks for the assessment of writing. Norwood, NJ: Ablex.
•
Skillings, M. J., & Ferrell, R. (2000). Student-generated rubrics: Bringing students into the assessment process. Reading Teacher, 53(6), 452–455.
•
White, J. O. (1982). Students learn by doing holistic scoring. English Journal, 50–51.