March 1, 1994

•

5 min (est.)

•

Vol. 51

•

No. 6

Making Performance Assessment Work: The Road Ahead

Determining what form of assessment is most useful for what purposes and evaluating the quality of the measures are two of the challenges advocates of alternative assessments face.

Teaching Strategies

As the smoke temporarily clears in the controversy over traditional versus alternative forms of assessment, proponents of new assessments are holding the high ground. Their victory is not surprising.

New assessments match conceptions of learning and thinking developed from research (Resnick and Resnick 1992). They provide the means to meet curriculum targets (NCTM 1989) through the inclusion of challenging new content. Further, alternative assessments provide teachers an occasion to rethink their understanding of subject matter and their fundamental instructional goals.

Although opponents to new assessments raise credible questions of cost, feasibility, objectivity, and fairness, for the most part educators are embracing what they believe to be a more useful view of assessment. The support of measurement experts—who have shifted their attention from reliability coefficients to the consequences of measures used to evaluate student achievement—further bolsters arguments in favor of performance assessment (Linn et al. 1991, Messick 1989, Shepard 1991).

The Challenges Facing Educators

By embracing alternative assessments, however, educators are beginning rather than ending a complex process. Many questions surface right away; among the foremost is determining which forms of assessment are most useful for which educational purposes. Certain conclusions are appropriate if assessments are used for high-stakes purposes, where fairness and comparability become especially important. Other decisions would be made when the purpose of the assessment is private and local—for example, when teachers help develop portfolios of students' cumulative work in order to assign grades or to focus classroom instruction. Even with lower stakes purposes, teachers will need to learn to distinguish among assessments of different quality and appropriateness for their students. The adage “seen one, seen them all” does not apply to the assessment domain. Teachers will also need to learn to design such assessments (see Herman et al. 1992 for guidelines). Complicating the picture, early findings on the classroom use of performance assessment suggest that changing the fundamental beliefs and instructional practices of teachers is much harder than assessment proponents thought (Aschbacher 1993, Borko et al. 1993).

While classroom use of assessments is undoubtedly the most important in the lives of teachers and students, much of the marketing of performance assessments has promoted their use for system monitoring, accountability, and program evaluation. Certificates of mastery required for the high school diploma and inferences about the nation's progress toward the National Education Goals depend more and more upon tests with substantial performance-assessment components.

Raising the stakes for the use of assessments simultaneously raises the stakes for judging the quality of assessments themselves. This process is being played out in recent discussions about the allocation and evaluation of Federal compensatory educational funds. Although Title I funds account for only a small proportion of education dollars, at present more than 70 percent of the nation's schools receive them. The reauthorization of the Elementary and Secondary Education Act and the transition to goals-based evaluation raise the urgency of our studies of the technical quality of performance assessments.

For example, the Secretary of Education's Advisory Committee on Chapter 1 (1993) recommended that the evaluation of Chapter 1 adopt the same approach advocated by the National Council on Education Standards and Testing (1992). That is, progress toward reaching national content standards should be judged using state-adopted assessment procedures rather than exclusive dependence upon nationally normed tests of basic skills. Because state-level assessments and commercial test publishers have rapidly moved into the performance-assessment arena, at minimum, educators will need to make judgments about the soundness of their claims and the quality of their measures in order to make choices among them.

Promoting Equity in Performance Assessments

An obvious question is how to make quality decisions about assessments. An initial set of criteria (developed by Linn et al. 1991) identifies issues that are central to determining the value of these assessments. These criteria, as recently revised (Baker et al. 1993), fall into two areas. Design criteria, to be judged primarily by looking at the assessment tasks and scoring rubrics, include: cognitive complexity, linguistic appropriateness, content quality, content coverage, and meaningfulness. Effects criteria, to be judged primarily by looking at empirical data, include: transfer and generalizability, instructional sensitivity, fairness, systemic consequences, and practicality and costs.

Let's focus here on those criteria with implications for equity in order to explore both the policy and technical complexities of performance-based assessments. Consider the validity criterion of meaningfulness, for instance. Each assessment task needs to be placed in a context in order to hold students' attention and so that students understand its purpose and make sense of its content. These processes contrast with tests that are decontextualized, have no inherent purpose, and presumably do not motivate students to high performance.

Another obvious equity concern is linguistic appropriateness. This validity criterion can be interpreted fairly simply, for instance, to mean that students can comprehend the language used in tasks and possess the language competency to convey their own subject matter understanding in their performance. This criterion could also be addressed in more sophisticated terms; for example, any text used in performance tasks should be reviewed to ensure that its linguistic structures and usage do not hinder the performance of English-as-a-second-language students whose first languages have structures that might confuse them.

Preeminent among the equity criteria is fairness. In the most direct way possible, this criterion addresses the extent to which students' performances are true representations of their competency and have been acquired, judged, and reported in an unbiased way. One implication of equity would be that any student's performance could be compared in a standard way to another student's performance. Yet this common definition of fairness would conflict with application of the meaningfulness and linguistic appropriateness criteria described above. For instance, if fairness means that the assessment administration circumstances are comparable for different students, does fair administration mean that all children receive identical assessment directions? Or does fairness mean that the attempt is made to ensure that all children have a comparable understanding of what is expected of them, even if it requires adapting the directions from child to child? If our goal is to teach “all” children, then how does the fairness criterion play out for children with various learning disabilities? If we believe that standard-task directions are critical, the criterion of fairness in administration may work against fairness in understanding and task motivation for most on-demand assessments.

The Issue of Instructional Sensitivity

The National Council on Education Standards and Testing has explored the issue of fairness and equity in performance assessments through various task forces. In the task force that is studying the desirability of national standards, the equity issue gave rise to a dialogue about delivery standards (NCEST 1992, pp. E-12, 13). Equity was also a crucial concern of the task forces on “system and school capacity indicators” (p. F-17) and “opportunity to learn” (pp. F-17, 18). After much discussion, NCEST members were unable to reach consensus on the desirability of delivery standards or on the meaning of opportunity to learn—and even less so on how to promote equity in school experiences. The clear intent of the NCEST task forces was to ensure that individual children were not “punished” for poor performance because they were enrolled in schools with inadequate programs. Yet, because NCEST members were apprehensive about prescriptiveness and Federal control, they deferred the problem of equitable delivery and opportunity to learn. As a result, the relationship of instruction to performance tasks was left out of the specific national plans.

Clearly, however, some thought must be given to the relationship between assessment and instruction, and the validity criterion of instructional sensitivity. We need to show that instruction affects performance on alternative assessments. The principal reason is, of course, that we do not want to depend on outcome measures that are not in the slightest degree responsive to our instructional efforts. If such were the case, in five years we might be in the position of seeing no improvement on achievement measures and pondering whether the data reflected problems in the instructional system or in measurement.

The relationship of performance assessment and instruction goes two ways. First, we need to be sure that people can be taught content, skills, and strategies that show up on performance measures. We want differences in performance to be attributable in large part to classroom instruction rather than innate qualities such as talent. At a minimum, we want some evidence that students' performance on these measures improves with teaching. For example, if a performance assessment asks students to design and build a pen for farm animals, what instruction is necessary to prepare students for such a task? Measurement? Planning? Woodcutting? If tasks ask students to complete many steps and use different kinds of knowledge, which ones are critical for instruction? We certainly don't want to repeat our failings with certain standardized tests and lead some people to believe that the only way to assure the accomplishment of a performance task is to teach that task directly. This problem has different shapes under various assessment conditions.

Imagine, for instance, a portfolio of a student's progress in writing from the beginning to the end of a school year. Such a portfolio would demonstrate, on the one hand, that the student's writing had grown more sophisticated or focused. It would be natural to infer that this growth took place as a function of teaching, although clearly at different age levels, student development strongly interacts with performance outcomes. If such tasks were used for high-stakes purposes, knowing the extent to which teaching contributed to performance would be important. We might make murky inferences by comparing performances of similar children in other classrooms to determine whether rates of progress were similar. But if portfolios are as individualistic as we think they should be, such comparisons would be difficult.

Another problem in the area of instructional sensitivity is determining the degree of contribution by the teacher (or by collaborating peers) to the student's displayed work. If the results of the portfolio are used to predict, say, the individual's ability to succeed in a specially enriched program, then we would want some information about the contributions of the teacher and student collaborators. Recent work by Webb (1993) has identified some of the issues involved in making individual performance inferences from group performance tasks.

Now consider a set of high-stakes performance tasks in science or history problem solving. If students do not perform well, we need to know whether teachers have instructed them in the assessed content and skills measured, for it would be unfair to withhold promotion, a diploma, or other incentives because students were denied an opportunity to learn the material.

How could we find out about instructional sensitivity in this case? Some efforts in this regard ask students and teachers to indicate whether they have been taught (or have taught) the topics covered by an examination, for example, electromagnetism or the Civil War (see Burstein 1993). Others ask the respondents whether they have completed exercises similar to the performance tasks—for instance, if they have ever read primary source historical material and written an explanation of it (Baker, in press). Such self-report data are susceptible to errors of various types, including flaws in memory and social desirability, particularly in high-stakes settings. In classes where the textbook provides the core for instruction, inspection of the materials covered or an intensive look at the content of classroom assignments would give a rough estimate of opportunity to learn.

For the most part, our ability to study opportunity to learn in performance assessments contexts is limited by the novelty of the procedure. A few experiments that have compared performances of students who have and have not been taught strategies and content measured by performance assessments indicate that some of these assessments are sensitive to instruction (Baker et al., in press). Yet, most student respondents report rarely having been challenged in tasks like those the assessment provides. If teachers have not as yet internalized instructional approaches consistent with such assessments, we cannot be surprised that performance is generally low, and lower for some minority students who may have had the least opportunity to learn complex material.

Where Do We Go from Here?

The challenge on the horizon is to assure that within the limits of our knowledge and the time available for implementation, we develop performance assessments that validly portray the quality of students' accomplishments. How shall this be done? There are at least two approaches.

One option is to create a formal certifying group, not unlike other consumer groups, to identify acceptable and unacceptable assessments and assessment approaches. The mandate for such a group could be, on the low end to identify “unsafe” assessments, or, in a more high-minded way, to identify the best assessments available. At one point, such a certification group was proposed by the National Council on Education Standards and Testing (1992). The idea is an old one, and for many years the UCLA Center for the Study of Evaluation conducted and published the results of a Test Evaluation Project. The project was terminated when it became clear that the proliferation of tests and other assessments was such that their evaluation would swamp all other research enterprises. The NCEST suggestion was limited to assessments that would be used to measure progress toward the national educational goals. Important factors in determining the quality of measures would be the validity evidence that they provide high-quality information, in conjunction with the use to which the assessments would be put.

A more bottom-up, hands-on approach is to devise standards for assessment and let users determine the extent to which specific assessments satisfy them. Our team is presently working with teachers and administrators who are translating our criteria into operational questions to be asked and answered at the school level. This type of grass-roots implementation of assessment standards has possible benefits as more schools move toward decentralized management, but the potential for ideological pressure on testing makes me prefer a model that builds-in some expertise base. In whatever way the structures evolve to assist us—through guidelines offered for the new Title I, through the revision of a more user-friendly version of the Psychological and Educational Test Standards (AERA, APA, and NCME 1985)—clearly we cannot displace our own responsibility for understanding and promoting the quality of any measure used to evaluate educational programs.

References

•

Advisory Committee on Testing in Chapter 1. (May 1993). Reinforcing the Promise, Reforming the Paradigm. Washington, D.C.: U.S. Department of Education.

•

American Educational Research Association, American Psychological Association, and National Council on Measurement in Education. (1985). Standards for Educational and Psychological Testing. Washington, D.C.: American Psychological Association.

•

Aschbacher, P. R. (1993). Issues in Innovative Assessment for Classroom Practice: Barriers and Facilitators (CSE Tech. Rep. No. 359). Los Angeles: University of California, National Center for Research on Evaluation, Standards, and Student Testing.

•

Baker, E. L. (In press). Learning-Based Assessments of History Understanding (Special Issue). Educational Psychologist.

•

Baker, E. L., R. L. Linn, and J. Abedi. (In press). “Student Understanding of History: The Dimensionality and Generalizability of Performance Assessments.” Journal of Educational Measurement.

•

Baker, E. L., H. F. O'Neil, Jr., and R. L. Linn. (December 1993). “Policy and Validity Prospects for Performance-Based Assessment.” American Psychologist 48, 12.

•

Borko, H., M. Flory, and K. Cumbo. (1993). Teachers' Ideas and Practices About Assessment and Instruction (CSE Tech. Rep. No. 366). Los Angeles: University of California, National Center for Research on Evaluation, Standards, and Student Testing.

•

Burstein, L. (April 1993). “Validating National Curriculum Indicators: A Conceptual Overview of the RAND/CRESST NSF Project.” Paper presented at the annual meeting of the American Educational Research Association, Atlanta, Ga.

•

Gearhart, M., J. L. Herman, E. L. Baker, and A. K. Whittaker. (1993). Whose Work Is It? (CSE Tech. Rep. No. 363). Los Angeles: University of California, National Center for Research on Evaluation, Standards, and Student Testing.

•

Herman, J. L., P. R. Aschbacher, and L. Winters. (1992). A Practical Guide to Alternative Assessment. Alexandria, Va.: Association for Supervision and Curriculum Development.

•

Linn, R. L., E. L. Baker, and S. B. Dunbar. (1991). “Complex, Performance-Based Assessment: Expectations and Validation Criteria.” Educational Researcher 20, 8: 15–21. (ERIC Document Reproduction Service No. EJ 436 999)

•

Messick, S. (1989). “Validity.” In Educational Measurement, edited by R. L. Linn, 3rd ed., pp. 13–103. New York: Macmillan.

•

National Council on Education Standards and Testing. (1992). Raising Standards for American Schools. Washington, D.C.: NCEST.

•

National Council of Teachers of Mathematics. (1989). Curriculum and Evaluation Standards for School Mathematics. Reston, Va.: NCTM.

•

Resnick, L. B., and Resnick, D. P. (1992). “Assessing the Thinking Curriculum: New Tools for Educational Reform.” In Future Assessments: Changing Views of Aptitude, Achievement, and Instruction edited by B. R. Gifford and M. C. O'Connor. Boston: Kluwer.

•

Shepard, L. (1991). “Psychometricians' Beliefs About Learning.” Educational Researcher 20: 2–16.

•

Webb, N. M. (1993). Collaborative Group Versus Individual Assessment in Mathematics: Group Processes and Outcomes (CSE Tech. Rep. No. 352). Los Angeles: University of California, Center for the Study of Evaluation.