Ms. Jackson is implementing portfolios in her classroom. Every month or so, she gives her students a new writing assignment. Sometimes the assignment is creative; sometimes it asks students to use information from their science or social studies work; sometimes it involves research in the community. First, the students engage in a variety of pre-writing activities and write drafts. Often, to supplement Ms. Jackson's routine feedback, students convene in small groups for peer review.
Students keep all their writing in a folder, periodically identifying the best pieces for their “showcase” portfolios. Students then take home their portfolios to discuss their progress and favorite pieces with their parents. At the end of the year, Ms. Jackson sends the portfolios to a central scoring site where she and other teachers participate in a statewide scoring effort. The state then plans to make the results public to show how well the schools prepare students in writing.
* * *
What will Ms. Jackson and her students gain from this innovative assessment program? Professional literature and national conference agendas extol the potential benefits of portfolios for teaching, learning, and assessment—particularly compared with traditional multiple-choice tests. Although initial findings favor portfolio assessments, the challenges lie in assuring technical quality, equity, and feasibility for large-scale assessment purposes.
Well-designed portfolios represent important, contextualized learning that requires complex thinking and expressive skills. Traditional tests have been criticized as being insensitive to local curriculum and instruction, and of assessing not only student achievement but aptitude. Portfolios are being heralded as vehicles that provide a more equitable and sensitive portrait of what students know and are able to do. Portfolios encourage teachers and schools to focus on important student outcomes, provide parents and the community with credible evidence of student achievement, and inform policy and practice at every level of the educational system.
And what of the evidence for these claims? Surprisingly, a dearth of empirical research exists. In fact, of 89 entries on portfolio assessment topics found in the literature over the past 10 years, only seven articles either report technical data or employ accepted research methods. Instead, most articles explain the rationale for portfolio assessment; present ideas and models for how portfolios should be constituted and used; or share details of how portfolios have been implemented in a particular class, school, district, or state. Relatively absent is attention to technical quality, to serious indicators of impact, or to rigorous testing of assumptions.
Technical Quality
Why worry about technical quality? Many portfolio advocates, bridling against the measurement experts who, they believe, have long defined assessment practice and used it to drive curriculum and instruction, do not seem to give much weight to technical characteristics. These advocates accept at face value the belief that performance assessments in general and portfolio assessments in particular are better than traditional multiple-choice tests, are better matched to new theories of curriculum and learning, and are more suitable for the thinking and problem-solving skills that students will need for future success.
Although such concerns for curriculum, instruction, and student learning can and should remain paramount in the design of new assessments, technical quality will continue to be a critical issue if results are used to make important decisions about students, teachers, and schools. We need to know that an assessment provides accurate information for the decisions we wish to make. Are the results of portfolio assessment reliable, consistent, and meaningful estimates of what students know and can do? If not, the assessment will produce fickle results. For students, teachers, schools, districts, or even states, adequate technical quality is essential.
And just what is technical quality? Many of us are familiar with two core elements—reliability and validity. These terms encompass a variety of technical questions and require a variety of evidence (see fig. 1). When available evidence is compared with the evidence needed to certify the soundness of portfolios for making high-stakes decisions about students, we find that much work remains to be done before claims as to the accuracy and usefulness of these assessments can be supported.
Figure 1. A Map of the Technical Territory of Portfolio Assessment
Synthesis of Research / Portfolio Research: A Slim Collection - table
Technical Issue: Reliability | Data Needed to Assure Quality |
---|
Are the scores consistent or stable? | • Scorer or rater agreement |
| • Interrater consistency |
| • Score stability for the same student on different occasions |
| • Score stability of papers/entries given different contexts or “portfolio sets” in which a portfolio is scored |
| • Score consistency across “like” tasks |
Technical Issue: Validity | Data Needed to Assure Quality |
What do the scores mean? | • What are the scoring criteria based upon (what standards, what definition of “excellence”)? |
| • How were tasks selected? What view of “achievement” do the portfolio contents present? |
Do inferences from the scores lead to accurate decisions about students? programs? schools? | • Are the scores diagnostic? When students score “high,” what can be said about their performance? What standards are met? What other things do students know or can they do? How do we know? |
| • When students score “low,” what directions for improving performance do the scores provide? How do we know that the diagnosis will lead to improved performance? |
| • Is the assessment fair? Are portfolios a disadvantage to any group of students? If so, how? |
| • What are the consequences of portfolio use on individual students? programs? schools? Do results lead to improvement in students and programs? |
Interrater agreement. Raters who judge student performance must agree regarding what scores should be assigned to students' work within the limits of what experts call “measurement error.” Do raters agree on how a portfolio ought to be scored? Do they assign the same or nearly similar scores to a particular student's work? If the answer is no, then student scores are a measure of who does the scoring rather than the quality of the work. Interrater agreement is accepted as the foundation upon which all other decisions about portfolio quality are made. But out of 46 portfolio assessments listed in the CRESST Alternative Assessments in Practice Database, only 13 report data on rater agreement (CRESST 1993).
Further, the technical data that exist show uneven results. On one hand, results from Vermont's statewide portfolio assessment program, perhaps the most visible example in the country, have been disappointing. Here, 4th and 8th graders kept portfolios in both writing and mathematics. Writing portfolios contained six to eight pieces representing various writing genres, with one designated as a best piece. Mathematics portfolios contained five to seven papers, each a best piece of three types: puzzles, math applications, and investigations. Although scoring criteria and procedures for the two types of portfolios differed, both used analytic scoring that rated students' work on a number of different dimensions.
In both subjects, the students' classroom teachers were the first to rate the work. Based on the first year of full implementation, Koretz, Stecher, and Deibert (1993) report interrater reliability of .28 to .60, depending on how the scores were aggregated. This level of agreement was not sufficient to permit reporting many of the aggregate statistics the state had planned to use: The state could not accurately report the proportion of students who achieved each point on the dimensions on which student work was scored, and it could not provide accurate data on the comparative performance of districts and schools.
In Pittsburgh, however, the districtwide assessment obtained high interrater agreement for its writing portfolios (LeMahieu et al. 1993). The Pittsburgh portfolio system grew out of ARTS PROPEL, a five-year project funded by the Rockefeller Foundation to design instruction-based assessment in visual arts, music, and imaginative writing (Camp 1993). Pittsburgh students in grades 5–12 developed their portfolios over a year, a process that required them to compose, revise, and reflect upon their writing. The reflection component was especially extensive and included student comments about the processes they used, their choices and writing purposes, the criteria they used in assessing their writing, and their focus for future work.
The portfolio contained at least six selections that met such general categories as “a satisfying piece,” “an unsatisfying piece,” “an important piece,” “a free pick,” and so on. Students included drafts and reflections with their finished work. District raters—including both ARTS PROPEL and other teachers—were free to select any evidence in a student's portfolio. The work was rated on: accomplishment in writing, use of process and resources, and growth and engagement. Despite the amount of latitude that raters had in selecting pieces to rate and the broad scope of the scoring criteria, interrater agreement correlations ranged from .60 to .70, and the generalizability estimate of interrater agreement when two raters reviewed each piece was in the .80 range.
In a third example (Herman et al. 1993), an elementary school-based study of writing portfolios found similarly high levels of interrater agreement. In this case, student portfolios contained final drafts of writing that had been assigned over the last half of the school year, composed mainly of narrative and, to a lesser extent, summaries and poems. Raters were recruited from outside the school, from a district that has a long and strong history in analytic writing assessment. Drawing on essentially the same dimensions used in their regular district writing assessment, raters gave each portfolio a single, overall quality score. Average correlations between scores given by pairs of raters was .82, and percentage of absolute agreements for all pairs of raters averaged .98.
These three examples demonstrate the possibility of achieving interrater reliability for classroom-based portfolios in a variety of configurations. While such reliability is probably easiest to achieve when the contents of portfolios are relatively uniform and when experienced scorers use well-honed rubrics (as in Herman et al. 1993), the Pittsburgh example shows that reliability is also possible when contents are loosely structured. However, consensus among raters, as the Vermont case shows, is not easily achieved. Available data suggest that such consensus depends on clearly articulated criteria, effective training, and rubrics that reflect shared experience, common values, and a deep understanding of student performance. In the Pittsburgh case, such consensus evolved over time through close collaboration.
Further, despite the promising results reported in Pittsburgh and study by Herman and colleagues (1993), little work has been done to examine other, equally important sources of portfolio score reliability, such as score stability over time, stability across different rater groups or pairs, and the effect of task or “context” (the portfolio set in which a particular portfolio is rated). Without this information, decisions about individual student portfolio scores are limited to the one occasion, particular raters, and specific tasks comprising a particular portfolio assessment.
Validity and meaning of scores. Though reliability in all its forms is necessary, it is not a sufficient prerequisite to the core issue in technical quality: validity. When we assess students, what we really want to know is, do the scores of a portfolio assessment represent some enduring and meaningful capability? Are scores good indicators of what we think we're assessing? Beyond claims that portfolio work “looks like” it captures important learning, validity issues have been very sparsely studied.
One useful approach in determining what portfolio scores mean is to look for patterns of relationships between the results of portfolio assessments and other indicators of student performance. Score meaning becomes supported when portfolio scores relate highly to other, valued measures of the same capability and show weak or no relationship to measures of different capabilities. Of course, this approach to verifying score meaning assumes we have good “other” measures for comparison. If, for example, we have no good measures of mathematics problem solving, how do we know whether portfolios are good measures of this capability? This is a vexing problem for validity studies—especially because one of the reasons for the popularity of alternative and portfolio assessments has been our waning trust in the value of existing measurement techniques.
Using this general approach, Koretz and colleagues (1993) investigated score relationships between measures of both similar and different capabilities in the Vermont program. The researchers found moderate correlations ranging from .47 to .58 between writing portfolio scores and direct writing assessments. When comparing writing portfolios with an “unlike capability,” in this case multiple-choice mathematics test scores, however, they found essentially the same level of correlation, rather than no or a very weak relationship.
The researchers state: “More grounds for pessimism occurred when mathematics portfolio scores were compared to ... uniform tests in both writing and mathematics” (Koretz et al. 1993, p. 85). Whereas Koretz and colleagues expected the two measures of mathematics to be highly correlated and the math portfolio to be, at most, weakly related to writing scores, in fact the relationships among all three measures were at a similarly low level.
Similarly, Gearhart and others (1993) found virtually no relationship when comparing results from writing portfolios with those from standard writing assessments. In fact, two-thirds of the students who would have been classified as “masters” based on the portfolio assessment score would not have been so classified on the basis of the standard assessment. When portfolios were scored in two different ways—one giving a single score to the collection as a whole and the second as the average of the scores for all the individually scored pieces in the portfolio—correlations between the two sets of scores were moderately high (in the .6 range). Even so, half the students who would have been classified as “masters” on the basis of the single portfolio score would not have been so classified when scores for individual pieces were averaged. Thus, a student classified as a capable writer on the basis of the portfolio would not necessarily do well when given a standard writing prompt. Further, students classified as capable on the basis of an overall quality score were not always so classified when each piece in the portfolio was scored separately.
Which assessment best represents an enduring capability, a generalizable skill? Does one context overestimate or another underestimate students' skills? These questions are unanswerable with available data, but are important validity issues, particularly for large-scale assessment purposes.
Fairness: Whose Work Is It?
Portfolio advocates firmly believe that classroom tasks are the better indicator of student capability because they are likely to reflect an authentic purpose, perhaps a more stimulating or relevant topic, an opportunity to engage in an extended writing process, and so on. They say that portfolios are more likely to elicit the true capability of most students, not just those motivated to do well on decontextualized, on-demand, one-shot tests.
This argument, however, may have another side. Rather than motivating better performance and providing a more supportive context than traditional tests, portfolios actually may overestimate student performance. Students often get support in planning, drafting, and revising writing that is part of a classroom assignment; in fact, this support is a hallmark of good instruction. But does this additional support from peers, teachers, or others constitute learning for an individual student, or does it simply make an individual student's work look better? This question is not a major issue when portfolios are used for classroom assessment. Teachers, after all, have many indicators of student capability and are aware of the conditions under which work is produced. In large-scale assessment settings where the question is “What can an individual do?” the issue becomes important indeed.
If some students get more help and support than others, are they not given unfair advantage during the assessment? The study by Gearhart and others (1993) in a small sample of elementary school classes raises this issue. The researchers asked teachers how much structure or prompting they provided individual students, what type of peer or teacher editorial assistance occurred, and what were the available resources and time for portfolio compilation. Results showed differences in the amount of support given to individual students within classrooms as well as differences between classrooms participating in the study. When students have different levels of assistance, how do we assess their work to determine what they actually can do individually? And how do we provide equitable assessment settings?
The popularity of group work and recommendations to include it in portfolios add additional complications to the “Whose work is it?” question. Webb (1993), for example, found substantial differences in students' performances when judged on the basis of cooperative group work compared with individual work. Not too surprisingly, low-ability students had higher scores on the basis of group work than on individual work. This, in fact, is an important reason for group work—groups may develop better solutions than individuals working alone. What is important to remember, however, is that the group product doesn't necessarily tell us about the capabilities of individuals.
Additional complications arise when class work merges with homework. The amount of help students get from family and friends becomes an additional threat to the validity of interpretations about individual scores. Consider the student whose screenwriter parent embellishes a composition compared with the student who receives no assistance. Will portfolio assessments put the latter student at a disadvantage? And what of the school in a community of highly educated professionals who are involved in their children's schooling? If portfolios are used for school accountability, does this school represent a better environment for educating students than one where students receive substantially less outside help? Differential support issues become important when student work is scored remotely and scores are used to make high-stakes decisions about students or schools.
The equity of portfolio assessment deserves continuing scrutiny. Research to date suggests that patterns of performance on portfolios mirror those on traditional measures in terms of the relative performance levels of disadvantaged or minority groups. LeMahieu and colleagues (1993), for example, in a study of writing portfolios, found that females do better than males and that white students show higher levels of performance than African-American students. Hearne and Schuman (1992) similarly found the same demographic patterns of performance for traditional standardized and portfolio assessments.
Effects of Implementing Portfolios
Based on self-reports from teachers and others, implementing portfolio assessment does appear to have positive effects on instruction. The majority of teachers queried in Vermont, for example, reported that the implementation of mathematics portfolios led them to devote more classroom time to teaching problem-solving strategies, and about half indicated that they spent more time than before helping their students to deal with mathematical patterns and relationships (Koretz et al. 1993). More than two-thirds noted an increase in emphasis on mathematical communication, and approximately half reported engaging students in more small group work than in prior years. Principals also confirmed these changes. The majority of principals interviewed affirmed that Vermont's portfolio assessment program had beneficial effects on curriculum and instruction, citing as examples specific changes in curriculum content and instructional strategies. Additionally, more than half the principals suggested that portfolios had value as an educational intervention to promote change.
Similarly, Aschbacher's action research (1993) suggests that involvement in the development and implementation of alternative assessments influences teachers' instructional practices and their attitudes toward students. Two-thirds of the teachers in her study, who received training and follow-up technical support in assessment development and scoring, reported substantial change in the ways they thought about their own teaching. As one teacher explained:The portfolios seem to mirror not only the students' work but the teacher's as well. As a result, I have found the need to re-work, re-organize, and re-assess my teaching strategies (p. 22).
The Aschbacher study also noted that using portfolios had influenced teachers' expectations for their students. Two-thirds reported at least some increase in expecting more thinking and problem solving, and higher levels of performance. These findings mirror those in Vermont, where more than 80 percent of the teachers surveyed reported changes in their views of their students' mathematical ability on the basis of portfolio work. As Koretz and colleagues (1993) put it, “Although the amount of change reported by most teachers was small, the pervasiveness of change was striking” (p. 23).
Feasibility Issues
While the literature is promising regarding potential effects of portfolios on curriculum and instruction, it also indicates the substantial time and challenges that portfolio use entails. For example, a majority of principals interviewed in Vermont believed that portfolio assessment generally had salutary effects on their schools' curriculum, instruction, and effects on student learning and attitudes, but almost 90 percent of these principals characterized the program as “burdensome,” particularly from the perspective of its demands on teachers (Koretz et al. 1993).
All studies reviewed reported substantial demands on teachers' time (Aschbacher 1993, Koretz et al., 1993; Wolf and Gearhart 1993): time for teachers to learn new assessment practices, to understand what should be included in portfolios and how to help students compile them, to develop portfolio tasks, to discern and apply criteria for assessing students' work, to reflect upon and fine-tune their instructional and assessment practices, and to work out and manage the logistics. The Vermont study, for example, asking about only some of these demands, found that teachers devoted 17 hours a month to choosing portfolio tasks, preparing portfolio lessons, and evaluating the contents of portfolios; and 60 percent of the teachers surveyed at both 4th and 8th grades indicated that they often lacked sufficient time to develop portfolio lessons (Koretz et al. 1993).
The time Vermont teachers spent developing tasks is indicative of another, even more significant challenge—helping teachers to change their teaching practices. Prior to the statewide assessment program, problem solving apparently wasn't a regular part of instruction in Vermont's classrooms. For example, many of the mathematics portfolios from the pilot year were unscorable because they did not contain work that required problem-solving skills. Have teachers been prepared to develop good assessment tasks? Have they been prepared to teach problem solving and to help students develop deep understanding of subject matter? The weight of evidence suggests not (Aschbacher 1991, 1992; Brewer 1991; Myers et al. 1992; Plake et al. 1992).
What is required is a paradigmatic shift not only in assessment, but in how teachers approach teaching. Teachers are being asked to engage students in deeper levels of cognitive involvement, rich content, and disciplinary understandings. They are also being asked to employ different instructional strategies—such as cooperative group work, extended assignments, discussion of portfolios, students' self reflection. And they are being asked to engage in different instructional roles—monitoring, coaching, and facilitating students' performances. Yet teachers' preparation for making such shifts is meager at best. Without such preparation, the quality of implementation looms as a very large issue. Reflecting on the ARTS PROPEL experience, Roberta Camp notes:The portfolio is far more than a procedure for gathering samples of student writing. Portfolio reflection has changed the climate of the classroom and the nature of teacher/student interactions. Reflection has become part of an approach to learning in which instruction and assessment are thoroughly integrated. Assessment is no longer an enterprise that takes place outside the classroom; it is one in which teachers and students are actively engaged on a recurring basis as they articulate and apply criteria to their own and one another's writing (1993).
Many people rightly worry about the costs of implementing large-scale portfolio assessment programs. While the direct and indirect costs of portfolios have had little study, Catterall and Winters (in press) report the cost ingredients are many: staff training, development of task specifications and prompts, administration of portfolio records and their storage, and scoring. For example, how are “one-shot” performances, such as speeches and dramatic presentations, recorded and stored in a portfolio? These costs pale, however, compared with those required to help teachers develop the skills necessary to realize the benefits of portfolio assessment. Needed are opportunities for professional development, ongoing support, technical expertise and time for teachers to develop, practice, reflect upon, and hone their instructional and assessment expertise.
The Future
What will the future bring for portfolio assessment? Will portfolio use benefit schools and children? Clearly there exist substantial challenges in the area of technical quality and feasibility if portfolios are to be used for large-scale assessment for high-stakes purposes. Evidence suggests that one basic requisite for technical quality—interrater reliability—is achievable. The conditions and costs of achievability, however, remain an open issue. More important to resolve are the validity of inferences about individual performances and a range of equity issues. On the one hand, some of these technical issues can probably be most easily solved if portfolio tasks are closely specified and highly standardized. But, in seeking technical rigor, we need to be sure not to lose the appeal of the portfolio concept.
Evidence about the impact of portfolio assessment on curriculum and instruction is weak, but provocative. Most educators believe that the use of portfolios encourages productive changes in curriculum, instruction, and student learning. Although this evidence is based solely on self-report data (with their well-known limitations), teachers and principals seem to think that portfolio assessment has encouraged them to rethink and to change their curriculum and instructional practices.
However, change alone is not enough—the quality of change and the efficacy of the new practices must be subjected to inquiry. For example, small group work alone, if not thoughtfully structured and if students are inadequately prepared, probably cannot facilitate students' learning. Similarly, the more frequent act of giving students extended assignments does not assure that they will be instructed effectively in how to complete such assignments or will receive effective feedback to help them hone their performance (Cohen and Ball 1990, Herman 1994).
Thus far, the literature is silent regarding how well new practices are being implemented. We know that student performance judged on the basis of large-scale portfolio assessments tends to be relatively low (Koretz et al. 1993, LeMahieu et al. 1993). We can infer from these findings that we have considerable room for improvement in both instructional practices and the quality of students' accomplishments. We cannot yet expect the relatively recent portfolio assessment projects to have influenced student outcomes, the ultimate indicator of effective practice. But we can cultivate appetites for research to address these and other issues that should guide educational assessment policy.