The time is ripe for a shift to more purposeful educational assessment.
Premium Resource
Students in the United States are being educated less well these days than they should be. A key cause for this calamity is that we often use the wrong tests to make our most important educational decisions. What's so sad about this state of affairs is that few people—even educators—actually understand how today's off-target educational tests contribute to diminished educational quality.
When future educational historians look back at the last few decades of U.S. public schooling, they will surely identify a system in which students' scores on annual accountability tests became, almost relentlessly, the prominent determiner of a school's success. But what if the tests currently being used to evaluate our schools are inappropriate? That would mean that many less-skilled teachers whose students score well on those tests aren't receiving the professional support they desperately need. Even worse, because the students of strong teachers might earn low scores on such tests, many of those fine teachers are being urged to abandon effective instructional techniques that appear to be flopping. The net effect is that massive numbers of students might be receiving a lower quality of education than they should.
The moment has come for us to reconsider how we test students in the United States. The time is ripe; here's why.
A Matter of Timing
In real estate, it's often said that the chief determiner of a property's worth is "location, location, location." Similarly, the key condition for wholesale changes in educational testing is "timing, timing, timing." Two recent events, both profoundly important, suggest that if an effort to meaningfully modify the nature of U.S. educational testing were ever to succeed, now is the time to undertake such a shift.
The Consortia-Built Tests
In September 2010, then-U.S. Secretary of Education Arne Duncan announced the awarding of major contracts to two consortia of U.S. states that had agreed to develop "next-generation" educational tests suitable for assessing students' mastery of the skills and knowledge set forth in the Common Core State Standards (CCSS), a set of curricular aims that almost all 50 states had adopted. At a cost of more than $350 million, these consortia-built tests were supposed to permit state-to-state, national comparisons regarding students' mastery of the CCSS curricular targets. Students' scores were to provide estimates of students' college and career readiness. Moreover, one of the two consortia contended that its tests would also serve as catalysts for increased student learning.
Now that both consortia have administered their tests and released examples of their test items, it's clear that no meaningful national comparisons will be possible. Not only are the reporting systems of the two consortia incompatible, but also a good many states have either dropped out of their chosen consortium or adopted alternative state curricular aims.
Nor does it seem that the consortia's tests will stimulate substantial instructional improvements. Many U.S. educators have concluded that the tests' score-reporting systems appear too broad to be of substantial instructional use to in-the-trenches teachers.
Publication of the Joint Standards
A second significant event was the July 2014 publication of the most recent revision of the Standards for Educational and Psychological Testing (2014). These standards represent a set of well-vetted admonitions from the three national organizations most concerned with U.S. educational testing: the American Educational Research Association, the American Psychological Association, and the National Council on Measurement in Education. These standards (frequently referred to as the joint standards because of the three organizations that collaboratively sponsor their development) are likely to have a major impact on U.S. educational testing because they play a prominent role in courtroom litigation involving educational tests.
Such cases often involve the use of accountability test scores to deny what the plaintiffs believe are constitutionally guaranteed rights. Given the current use of test scores to evaluate teacher quality, we are apt to get a flock of new cases involving firing and tenure decisions about teachers based, at least in part, on students' scores on achievement tests. The new joint standards will shake up that cage for sure.
For many years, and in several previous editions of the Standards document, assessment validity has been seen as the accuracy with which a test-based inference (or, if you prefer, a test-based interpretation) depicts a test taker's covert capabilities, such as his or her mastery of a cognitive skill or body of knowledge. Indeed, assessment validity has generally been conceded to be the most important concept in all of educational measurement. Yet for a number of years, increasingly large numbers of measurement specialists have also clamored for tests to be judged not only by the inference-accuracy stemming from test scores, (such as the degree to which a student has mastered a set of state curricular goals) but also according to the consequences stemming from those inferences, (such as whether a low-scoring student should be denied a high school diploma). Happily, the architects of the 2014 joint standards coupled the consequences of test usage with the need to arrive at accurate test-based interpretations regarding what a test taker's score means.
To illustrate, let's consider how the new joint Standards document introduces its chapter on validity:
Validity refers to the degree to which evidence and theory support the interpretations of test scores for proposed uses of tests. Validity is, therefore, the most fundamental consideration in developing tests and evaluating tests. (p. 11)
The game-changing phrase in this description of validity is "interpretations of test scores for proposed uses." By blending the proposed use of a test with the need to arrive at accurate interpretations about a test taker's performance, the writers of the 2014 standards adroitly resolved how to infuse test usage into the quest for accurate score-based interpretations.
To underscore the impact of the revised joint standards, on September 25, 2015, the U.S. Department of Education released its non-regulatory guidance indicating how state assessment systems must satisfy provisions of the Elementary and Secondary Education Act of 1965 as amended. According to the document, the "Department's guidance reflects the revised Standards for Educational and Psychological Testing" (p. 3). Thus, the joint Standards document has an impact on U.S. educational testing, and the federal government's endorsement of those standards dramatically underscores their significance.
How We Got Here
For almost a full century, the majority of standardized tests in U.S. education have been aimed at providing test users with comparative score interpretations. Such comparisons were necessary to compare the performance of a test taker who scored at a particular level (in relation to the scores of previous test takers who constituted the test's norm group) with other test takers who had scored at different levels.
A major boost to this comparatively oriented testing occurred during World War I, when the Army Alpha test was administered to more than 1,750,000 U.S. Army recruits to identify those most likely to succeed in officer training programs. The Alpha, designed to measure recruits' verbal and quantitative cognitive abilities, was an aptitude test—that is, a test designed to predict test takers' performance in a subsequent setting. The test proved remarkably effective in isolating the most intellectually able recruits through its comparative score interpretation strategy.
Shortly after World War I, a number of educational achievement tests were produced in the United States, such as the Stanford Achievement Tests. Unlike aptitude tests like the Alpha, these tests were not intended to measure students' future success. Instead, they tested students' mastery of particular content, such as their knowledge and skills in mathematics, language arts, or social studies. Influenced by the Alpha's success, these rapidly emerging nationally standardized achievement tests relied on the same test-development strategy embodied in building the Army Alpha—that is, a comparative score interpretation approach. For almost 100 years, such a comparative approach to educational testing has dominated assessment in U.S. schools.
Do tests sometimes need to provide comparative score interpretations? Of course. In most fixed-quota settings in which there are more applicants than the number of available openings, a comparatively oriented measurement strategy can identify the best (or the worst) among a group of test takers. Although the need to arrive at such comparative score interpretations can constitute a useful purpose of educational testing, it's not its sole purpose. What we need today is acknowledgement that validity depends on the purpose for which a test is to be used.
The Three Primary Purposes of Tests
If the most important consideration in appraising an educational test is the degree to which it yields valid inferences related to the test's proposed use, then it seems obvious we should abandon a century's worth of one-approach-fits-all thinking. We should adopt an approach in which the intended purpose of a test plays the unequivocally dominant role. I suggest we refer to a measurement approach wherein tests are built and appraised according to this strategy as purposeful educational assessment.
Instead of a test's purpose being "something to consider," perhaps casually, when constructing and appraising an educational test, purposeful educational assessment requires that the test's primary purpose becomes the overridingly important factor in the creation and evaluation of the test.
To illustrate, consider the screwdriver. In addition to its recognized screw-tightening and screw-loosening function, we can also use it for a host of other purposes, such as poking holes in a can of auto oil. But the screwdriver's primary purpose is to insert and extract screws.
An educational test also has a primary purpose. Here are the three primary purposes of almost all educational tests:
Comparisons among test takers. One primary purpose of educational testing is to compare students' test performances in order to identify score-based differences among individual students or groups of students. The comparisons often lead to the classification of students' scores on a student-by-student or group-by-group basis.
Improvement of ongoing instruction and learning. A second primary purpose is to elicit evidence regarding students' current levels of learning so educators can make informed decisions regarding changes in their ongoing instruction or in students' current efforts to learn.
Evaluation of instruction. A third primary purpose is to determine the quality of an already-completed set of instructional activities. This is often referred to as summative evaluation.
One of these three primary purposes must always be paramount in any application of purposeful educational assessment. With that in mind, let's briefly revisit each of the three purposes.
Comparisons Among Test Takers
Assessing test takers' performance to allow for score-based comparisons is the measurement mission that has dominated U.S. educational assessment for almost a century. We can make these comparisons on a student-by-student basis—for example, when we calculate students' percentile-based status in relation to that of a norm group. Or we can make them by assigning students to such qualitatively distinct categories as advanced, proficient, basic, or below basic.
When a test's function is chiefly comparative, decisions might be made about individual students, such as grade-to-grade promotions for 3rd graders whose scores on a standardized reading test indicate their readiness for 4th grade reading. In terms of decisions linked to differences among groups, we often see schools labeled as "under-performing" if too few students score "proficient" on state accountability tests. Although the "proficient" cut-score might appear to be a set criterion rather than a comparison, what's not apparent is that the most important determinant of the cut score to be used when distinguishing among such performance labels as "proficient" or "advanced" is almost always the test-score comparisons among large groups of students.
But what's important to recognize about comparison-focused testing is that any subsequent decisions we make about test takers—whether it's denying a diploma, bestowing an award, or satisfying some government requirement—are simply applications of a test whose primary purpose was to compare.
Improvement of Ongoing Instruction and Learning
This second purpose of testing is integral to formative assessment, a process in which teachers collect assessment-elicited evidence during an instructional sequence to enable them to adjust their current instruction or enable students to adjust their current learning tactics. Whether focused on groups of students or individual students, the thrust of such testing is to engender more tailored instruction to better meet student needs.
To illustrate, if a classroom assessment indicates that certain low-scoring students are weak in Subskill X and other low-scoring students are weak in Subskill Z, their teacher can then dish out the appropriate subskill remediation instruction to the students who really need it.
Please note that this second primary purpose of educational testing deals with the improvement of ongoing instruction and learning for the same students, not future instruction and learning involving a different set of students. We can best improve future instruction by using tests whose primary purpose is to evaluate instruction.
Because this second primary purpose is focused on improving instruction and learning, an effective use of such testing would be to supply instructionally diagnostic assessment evidence that would permit teachers to customize their instruction. Diagnostic tests serving this purpose will yield the most useful instructional insights if they contain at least a small number of items measuring students' mastery of an instructionally addressable body of knowledge or cognitive subskill. Although it's usually impossible to totally particularize a set of instructional activities for an entire class of diverse students, even a modest degree of assessment-abetted instructional particularization is surely preferable to none at all.
Evaluation of Instruction
This third primary purpose of educational testing is integral to educators' decisions about whether the instruction they provide for their students is good enough. This testing might focus on a lengthy instructional segment, such as an academic year's worth of reading instruction, or on a lesson that takes up only one or two class periods.
Because tests fulfilling this third primary purpose should help educators tell whether their current instructional efforts are adequate, there's rarely a need for particularization at the level of the individual student. When school district officials or individual teachers are trying to decide whether to alter the instructional program for next year, they'll typically base most of their evaluation-informed decisions on group-aggregated data. For instance, if the teachers and administrators in a particular elementary school are trying to decide whether they wish to adopt a schoolwide emphasis on close reading during the next school year, the evidence to help inform that decision will usually come from all of their students' current scores on reading tests.
Some background can be helpful here. Throughout history, teachers have naturally wanted to know whether their efforts were effective. Yet not until the passage of the Elementary and Secondary Education Act in 1965 were U.S. educators formally directed by the federal government to collect evidence regarding the effectiveness of their instruction—in particular, the effectiveness of those educational interventions that had been federally supported. State-level recipients of a given year's federal funds supporting Program X were required to supply evaluative evidence that Program X had, indeed, been worth what it had cost the nation's taxpayers. All across the United States, almost overnight, educational evaluation was born.
Because appraising the caliber of an educational intervention is typically centered on determining how much students have learned, it was not surprising to see, in the 1960s, an emerging flock of educational evaluators turning to the tests with which they were most familiar. These novice evaluators had almost always relied on students' test scores from off-the-shelf nationally standardized achievement tests distributed by a half-dozen major educational testing firms. And so it has happened that, for more than a half-century, the evaluation of most U.S. educational interventions, particularly the largest ones, has been fueled by students' scores on nationally standardized achievement tests or, in some instances, on state-developed replicas.
One is struck by the increasingly prevalent use of evidence from such tests to indicate that U.S. schools are "less effective than they ought to be." Interestingly, in almost all of those evaluation-focused applications of a test's results, the tests used haven't been shown to be suitable for this function. More often than not, a comparatively focused test has been employed in an evaluative role—even though the test was not designed for such a mission.
What's at Issue?
Perhaps the best way to determine an educational test's primary purpose is to identify the decision to be made on the basis of a test taker's performance. If we can isolate this decision, then the test's purpose will almost always become apparent. For example,
If a school needs to decide which students should be assigned to a special enrichment course in science, then the purpose of a test to help make that decision would be comparative.
If the decision on the line is how to improve students' mastery of a recently adopted set of curricular aims, then the purpose of a test would be instructional.
If a district's school board members are trying to determine whether an expensive tutorial program is worth its cost, then those board members could make a better decision by using a test whose primary purpose was evaluative.
However, if purposeful educational testing doesn't guide the way we develop and evaluate tests, then this three-category distinction among purposes is flat-out feckless. To have the necessary impact, a purposeful assessment approach must, from its conceptualization, influence every major decision along the way.
To illustrate, if a test is being built to improve ongoing teaching and learning, then it's imperative that the test's builders not attempt to measure students' mastery of too many assessment targets. Trying to measure too many targets makes it impossible to assess mastery of particular targets; the test can't include enough items per target to provide sound estimates of students' mastery of that target. Thus, the test's builders must, from the outset, resist ill-conceived demands to measure too much. Prioritizing proposed assessment targets can help identify a manageable number of targets that can contribute to improved instruction.
A word of caution here: Attention to a test's purpose should not be an innocuous exercise in verbally explicating a test's measurement mission. On the contrary, the primary purpose of a particular educational test—from the very get-go—should dominate the decision making of those who are building the test as well as those who are evaluating it. Currently, emphasis on purpose is absent from U.S. educational testing.
One possible way for educators, and particularly for educational policymakers, to begin deriving dividends from purposeful educational testing is to start demanding what the new joint Standards call for—namely, evidence that a given educational test is, as the Brits say, "fit for purpose."
Author's note: This article is based on the author's Robert L. Linn Distinguished Address Award presentation, Invitation to an Insurrection: A Call for Fundamentally Altering the Way We Build and Evaluate Educational Tests, to be delivered at the American Educational Research Association Annual Meeting, April 8–12, 2016, Washington, DC.
1 American Educational Research Association, American Psychological Association, National Council on Measurement in Edu- cation, and Joint Committee on Standards for Educational and Psychological Testing. (2014). Standards for educational and psychological testing. Washington, DC: Author.
•
2 U.S. Department of Education. (2015, September 25). U.S. Department of Education peer review of state assessment systems: Non-Regulatory guidance for states. Washington, DC: U.S. Department of Education, Office of Elementary and Secondary Education.