The results of educational tests are playing such an important role in today's schools that Educational Leadership has decided to tackle the issue head-on. This year, noted assessment authority W. James Popham will discuss key assessment concepts. So for all those alert readers who noticed a change in this column's focus—from accountability to assessment—you noticed right!
According to my dictionary, a grail is, figuratively, something "being earnestly pursued or sought after." Mention Holy Grail these days, and images involving medieval crusaders or a modern-day Indiana Jones come to mind. Grails are definitely good things to seek.
Interestingly, the field of educational measurement has its own Holy Grail. And it's called validity. Indeed, if 100 assessment specialists were asked, What is the single most important concept in educational measurement? 100—or more perhaps!—would answer "validity." (Of course, getting more than 100 answers from 100 people would be an instance of measurement error!)
But despite the increasingly important role that assessment now plays in educational accountability and despite the absolute centrality of validity to educational assessment, much misunderstanding exists among educators regarding the nature of assessment validity. One reason for the confusion is that, in nonmeasurement usage, the word simply reeks of goodness. In dictionary talk, for example, the term validmeans "sound, just, or well-founded."
But in educational assessment, validity means something different. Assessment validity refers tothe accuracy of a score-based inference about a test taker's status. This definition sounds pretty highbrow, but it really isn't. Educators are interested in getting a fix on students' knowledge and skills so they can make sensible instructional decisions about those students. But teachers can't tell how much a particular student knows merely by looking at the student. That's because students' cognitive skills and knowledge are covert. Accordingly, we test students so we can use theirovert responses to the test to make an inferenceabout what's covert. Tests aren't valid or invalid;inferences are.
Let's say a 5th grader takes a district reading-skills test and answers 30 of the test's 40 items correctly. (We call that 30-correct a raw score.) What does a raw score of 30 on this test tell us? Well, we might make a criterion-referenced inferenceabout the score (one focused on whether the student meets key learning criteria) by concluding, "This student appears to have mastered about 75 percent of the skills measured by the reading test." Or we could arrive at anorm-referenced inference about the score (one focused on comparing the student's performance with that of others) by asserting, "This student has outperformed 89 percent of the district's 5th graders on the test." Using either interpretive framework, we ascribe meaning to a raw score by making an inference about what the raw score represents.
Why is this distinction between valid tests and valid inferences important? Because if educators start to think in terms of a valid test, they will tend to automatically attribute accuracy to the test itself and, therefore, regard as accurate any inferences based on that valid—that is, accurate—test. But all inferences are interpretive judgmentsthat human beings make about students' test performances. And human beings can sometimes mess up. Educators who routinely regard validity as dealing with judgmental inferences rather than with the products of flaw-free tests will, as a result, make their test-based interpretations with warranted caution.
When it comes to the sorts of evidence needed to confirm the validity of test-based inferences in education, the field relies on a collection of authoritative guidelines known as the Standards for Educational and Psychological Testing, which is revised every decade or so. The most recent revision of the Standards was published in 1999, and the preparation of a new version is just getting under way. But—and here's where the validity soup can start to sour—the kind of evidence approved by the 1999 Standardsrefers exclusively to inferences about the meaning of an individual test taker's performance. This is, of course, the most important inference we can make about a student's test performance. Yet in this era of educational accountability, we find people making test-based inferences that go well beyond what a student's raw score means.
To illustrate, students' scores on annual accountability tests are typically employed to make what I'll call second-step inferences about teachers' instructional success. A first-step inference focuses on the degree to which students' raw scores accurately reflect their mastery of whatever is being tested. A second-step inference would center on what caused those raw scores to be as low or as high as they are. For example, do low achievement-test scores of economically disadvantaged children signify that those children will never be able to master whatever is being tested? Do high achievement-test scores by students in an affluent suburban school tell us that those students have been appropriately taught? There are no guidelines in the 1999 Standards indicating what sorts of evidence we should assemble to support such second-step inferences. In the absence of supporting evidence, second-step inferences about students' test scores are little more than conjecture.
In recent years, I have been greatly dismayed by the fact that many accountability tests are instructionally insensitive—even those accompanied by ample evidence that they assess students' mastery of a set of curricular aims (a first-step inference). To illustrate, if an accountability test has been constructed in such a way that students' performances are unduly influenced by those students' socioeconomic status, then the test will be instructionally insensitive; it will not allow us to make a valid second-step inference about educators' instructional effectiveness.
Assessment validity, then, is all about inferences. You can learn more about it by spending a few minutes with any introductory educational measurement textbook. Because validity is so central to assessment, and because assessment evidence is so central to educational accountability, it's reasonable to infer that every educator should learn enough about validity to be comfortable with this inference-laced concept.