March 1, 1999

•

5 min (est.)

•

Vol. 56

•

No. 6

Why Standardized Tests Don't Measure Educational Quality

W. James Popham

Educators are experiencing almost relentless pressure to show their effectiveness. Unfortunately, the chief indicator by which most communities judge a school staff's success is student performance on standardized achievement tests.

Assessment & Grading Curriculum Design & Lesson Planning

These days, if a school's standardized test scores are high, people think the school's staff is effective. If a school's standardized test scores are low, they see the school's staff as ineffective. In either case, because educational quality is being measured by the wrong yardstick, those evaluations are apt to be in error.

One of the chief reasons that students' standardized test scores continue to be the most important factor in evaluating a school is deceptively simple. Most educators do not really understand why a standardized test provides a misleading estimate of a school staff's effectiveness. They should.

What's in a Name?

A standardized test is any examination that's administered and scored in a predetermined, standard manner. There are two major kinds of standardized tests: aptitude tests and achievement tests.

Standardized aptitude tests predict how well students are likely to perform in some subsequent educational setting. The most common examples are the SAT-I and the ACT both of which attempt to forecast how well high school students will perform in college.

But standardized achievement-test scores are what citizens and school board members rely on when they evaluate a school's effectiveness. Nationally, five such tests are in use: California Achievement Tests, Comprehensive Tests of Basic Skills, Iowa Tests of Basic Skills, Metropolitan Achievement Tests, and Stanford Achievement Tests.

A Standardized Test's Assessment Mission

The folks who create standardized achievement tests are terrifically talented. What they are trying to do is to create assessment tools that permit someone to make a valid inference about the knowledge and/or skills that a given student possesses in a particular content area. More precisely, that inference is to be norm-referenced so that a student's relative knowledge and/or skills can be compared with those possessed by a national sample of students of the same age or grade level.

Such relative inferences about a student's status with respect to the mastery of knowledge and/or skills in a particular subject area can be quite informative to parents and educators. For example, think about the parents who discover that their 4th grade child is performing really well in language arts (94th percentile) and mathematics (89th percentile), but rather poorly in science (39th percentile) and social studies (26th percentile). Such information, because it illuminates a child's strengths and weaknesses, can be helpful not only in dealing with their child's teacher, but also in determining at-home assistance. Similarly, if teachers know how their students compare with other students nationwide, they can use this information to devise appropriate classroom instruction.

But there's an enormous amount of knowledge and/or skills that children at any grade level are likely to know. The substantial size of the content domain that a standardized achievement test is supposed to represent poses genuine difficulties for the developers of such tests. If a test actually covered all the knowledge and skills in the domain, it would be far too long.

So standardized achievement tests often need to accomplish their measurement mission with a much smaller collection of test items than might otherwise be employed if testing time were not an issue. The way out of this assessment bind is for standardized achievement tests to sample the knowledge and/or skills in the content domain. Frequently, such tests try to do their assessment job with only 40 to 50 items in a subject field—sometimes fewer.

Accurate Differentiation As a Deity

The task for those developing standardized achievement tests is to create an assessment instrument that, with a handful of items, yields valid norm-referenced interpretations of a student's status regarding a substantial chunk of content. Items that do the best job of discriminating among students are those answered correctly by roughly half the students. Devlopers avoid items that are answered correctly by too many or by too few students.

As a consequence of carefully sampling content and concentrating on items that discriminate optimally among students, these test creators have produced assessment tools that do a great job of providing relative comparisons of a student's content mastery with that of students nationwide. Assuming that the national norm group is genuinely representative of the nation at large, then educators and parents can make useful inferences about students.

One of the most useful of those inferences typically deals with students' relative strengths and weaknesses across subject areas, such as when parents find that their daughter sparkles in mathematics but sinks in science. It's also possible to identify students' relative strengths and weaknesses within a given subject area if there are enough test items to do so. For instance, if a 45-item standardized test in mathematics allocates 15 items to basic computation, 15 items to geometry, and 15 items to algebra, it might be possible to get a rough idea of a student's relative strengths and weaknesses in those three realms of mathematics. More often than not, however, these tests contain too few items to allow meaningful within-subject comparisons of students' strengths and weaknesses.

A second kind of useful inference that can be based on standardized achievement tests involves a student's growth over time in different subject areas. For example, let's say that a child is given a standardized achievement test every third year. We see that the child's percentile performances in most subjects are relatively similar at each testing, but that the child's percentiles in mathematics appear to drop dramatically at each subsequent testing. That's useful information.

Unfortunately, both parents and educators often ascribe far too much precision and accuracy to students' scores on standardized achievement tests. Several factors might cause scores to flop about. Merely because these test scores are reported in numbers (sometimes even with decimals!) should not incline anyone to attribute unwarranted precision to them. Standardized achievement test scores should be regarded as rough approximations of a student's status with respect to the content domain represented by the test.

To sum up, standardized achievement tests do a wonderful job of supplying the evidence needed to make norm-referenced interpretations of students' knowledge and/or skills in relationship to those of students nationally. The educational usefulness of those interpretations is considerable. Given the size of the content domains to be represented and the limited number of items that the test developers have at their disposal, standardized achievement tests are really quite remarkable. They do what they are supposed to do.

But standardized achievement tests should not be used to evaluate the quality of education. That's not what they are supposed to do.

Measuring Temperature with a Tablespoon

For several important reasons, standardized achievement tests should not be used to judge the quality of education. The overarching reason that students' scores on these tests do not provide an accurate index of educational effectiveness is that any inference about educational quality made on the basis of students' standardized achievement test performances is apt to be invalid.

Employing standardized achievement tests to ascertain educational quality is like measuring temperature with a tablespoon. Tablespoons have a different measurement mission than indicating how hot or cold something is. Standardized achievement tests have a different measurement mission than indicating how good or bad a school is. Standardized achievement tests should be used to make the comparative interpretations that they were intended to provide. They should not be used to judge educational quality. Let's look at three significant reasons that it is thoroughly invalid to base inferences about the caliber of education on standardized achievement test scores.

Testing-Teaching Mismatches

The companies that create and sell standardized achievement tests are all owned by large corporations. Like all for-profit businesses, these corporations attempt to produce revenue for their shareholders.

Recognizing the substantial pressure to sell standardized achievement tests, those who market such tests encounter a difficult dilemma that arises from the considerable curricular diversity in the United States. Because different states often choose somewhat different educational objectives (or, to be fashionable, different content standards), the need exists to build standardized achievement tests that are properly aligned with educators' meaningfully different curricular preferences. The problem becomes even more exacerbated in states where different counties or school districts can exercise more localized curricular decision making.

At a very general level, the goals that educators pursue in different settings are reasonably similar. For instance, you can be sure that all schools will give attention to language arts, mathematics, and so on. But that's at a general level. At the level where it really makes a difference to instruction—in the classroom—there are significant differences in the educational objectives being sought. And that presents a problem to those who must sell standardized achievement tests.

In view of the nation's substantial curricular diversity, test developers are obliged to create a series of one-size-fits-all assessments. But, as most of us know from attempting to wear one-size-fits-all garments, sometimes one size really can't fit all.

The designers of these tests do the best job they can in selecting test items that are likely to measure all of a content area's knowledge and skills that the nation's educators regard as important. But the test developers can't really pull it off. Thus, standardized achievement tests will always contain many items that are not aligned with what's emphasized instructionally in a particular setting.

To illustrate the seriousness of the mismatch that can occur between what's taught locally and what's tested through standardized achievement tests, educators ought to know about an important study at Michigan State University reported in 1983 by Freeman and his colleagues. These researchers selected five nationally standardized achievement tests in mathematics and studied their content for grades 4–6. Then, operating on the very reasonable assumption that what goes on instructionally in classrooms is often influenced by what's contained in the texbooks that children use, they also studied four widely used textbooks for grades 4-6.

Employing rigorous review procedures, the researchers identified the items in the standardized achievement test that had not received meaningful instructional attention in the textbooks. They concluded that between 50 and 80 percent of what was measured on the tests was not suitably addressed in the textbooks. As the Michigan State researchers put it, "The proportion of topics presented on a standardized test that received more than cursory treatment in each textbook was never higher than 50 percent" (p. 509).

Well, if the content of standardized tests is not satisfactorily addressed in widely used textbooks, isn't it likely that in a particular educational setting, topics will be covered on the test that aren't addressed instructionally in that setting? Unfortunately, because most educators are not genuinely familiar with the ingredients of standardized achievement tests, they often assume that if a standardized achievement test asserts that it is assessing "children's reading comprehension capabilities," then it's likely that the test meshes with the way reading is being taught locally. More often than not, the assumed match between what's tested and what's taught is not warranted.

If you spend much time with the descriptive materials presented in the manuals accompanying standardized achievement tests, you'll find that the descriptors for what's tested are often fairly general. Those descriptors need to be general to make the tests acceptable to a nation of educators whose curricular preferences vary. But such general descriptions of what's tested often permit assumptions of teaching-testing alignments that are way off the mark. And such mismatches, recognized or not, will often lead to spurious conclusions about the effectiveness of education in a given setting if students' scores on standardized achievement tests are used as the indicator of educational effectiveness. And that's the first reason that standardized achievement tests should not be used to determine the effectiveness of a state, a district, a school, or a teacher. There's almost certain to be a significant mismatch between what's taught and what's tested.

A Psychometric Tendency to Eliminate Important Test Items

A second reason that standardized achievement tests should not be used to evaluate educational quality arises directly from the requirement that these tests permit meaningful comparisons among students from only a small collection of items.

A test item that does the best job in spreading out students' total-test scores is a test item that's answered correctly by about half the students. Items that are answered correctly by 40 to 60 percent of the students do a solid job in spreading out the total scores of test-takers.

Items that are answered correctly by very large numbers of students, in contrast, do not make a suitable contribution to spreading out students' test scores. A test item answered correctly by 90 percent of the test-takers is, from the perspective of a test's efficiency in providing comparative interpretations, being answered correctly by too many students.

Test items answered correctly by 80 percent or more of the test takers, therefore, usually don't make it past the final cut when a standardized achievement test is first developed, and such items will most likely be jettisoned when the test is revised. As a result, the vast majority of the items on standardized achievement tests are "middle difficulty" items.

As a consequence of the quest for score variance in a standardized achievement test, items on which students perform well are often excluded. However, items on which students perform well often cover the content that, because of its importance, teachers stress. Thus, the better the job that teachers do in teaching important knowledge and/or skills, the less likely it is that there will be items on a standardized achievement test measuring such knowledge and/or skills. To evaluate teachers' instructional effectiveness by using assessment tools that deliberately avoid important content is fundamentally foolish.

Confounded Causation

The third reason that students' performances on these tests should not be used to evaluate educational quality is the most compelling. Because student performances on standardized achievement tests are heavily influenced by three causative factors, only one of which is linked to instructional quality, asserting that low or high test scores are caused by the quality of instruction is illogical.

To understand this confounded-causation problem clearly, let's look at the kinds of test items that appear on standardized achievement tests. Remember, students' test scores are based on how well students do on the test's items. To get a really solid idea of what's in standardized tests, you need to grub around with the items themselves.

The three illustrative items presented here are mildly massaged versions of actual test items in current standardized achievement tests. I've modified the items' content slightly, without altering the essence of what the items are trying to measure.

The problem of confounded causation involves three factors that contribute to students' scores on standardized achievement tests: (1) what's taught in school, (2) a student's native intellectual ability, and (3) a student's out-of-school learning.

What's taught in school. Some of the items in standardized achievement tests measure the knowledge or skills that students learn in school. In certain subject areas, such as mathematics, children learn in school most of what they know about a subject. Few parents spend much time teaching their children about the intricacies of algebra or how to prove a theorem.

So, if you look over the items in any standardized achievement test, you'll find a fair number similar to the mathematics item presented in Figure 1, which is a mildly modified version of an item appearing in a standardized achievement test intended for 3rd grade children.

Figure 1. A 3rd Grade Standardized Achievement Test Item in Mathematics

Sally had 14 pears. Then she gave away 6. Which of the number sentences below can you use to find out how many pears Sally has left?

14 + 6 = ___
6 + 14 = ___
__ − 6 = 14
14 − 6 = ___

This mathematics item would help teachers arrive at a valid inference about 3rd graders' abilities to choose number sentences that coincide with verbal representations of subtraction problems. Or, along with other similar items dealing with addition, multiplication, and division, this item would contribute to a valid inference about a student's ability to choose appropriate number sentences for a variety of basic computation problems presented in verbal form.

If the items in standardized achievement tests measured only what actually had been taught in school, I wouldn't be so negative about using these tests to determine educational quality. As you'll soon see, however, other kinds of items are hiding in standardized achievement tests.

A student's native intellectual ability. I wish I believed that all children were born with identical intellectual abilities, but I don't. Some kids were luckier at gene-pool time. Some children, from birth, will find it easier to mess around with mathematics than will others. Some kids, from birth, will have an easier time with verbal matters than will others. If children came into the world having inherited identical intellectual abilities, teachers' pedagogical problems would be far more simple.

Recent thinking among many leading educators suggests that there are various forms of intelligence, not just one (Gardner, 1994). A child who is born with less aptitude for dealing with quantitative or verbal tasks, therefore, might possess greater "interpersonal" or "intrapersonal" intelligence, but these latter abilities are not tested by these tests. For the kinds of items that are most commonly found on standardized achievement tests, children differ in their innate abilities to respond correctly. And some items on standardized achievement tests are aimed directly at measuring such intellectual ability.

Consider, for example, the item in Figure 2. This item attempts to measure a child's ability "to figure out" what the right answer is. I don't think that the item measures what's taught in school. The item measures what students come to school with, not what they learn there.

Figure 2. A 6th Grade Standardized Achievement Test Item in Social Studies

If someone really wants to conserve resources, one good way to do so is to:

leave lights on even if they are not needed.
wash small loads instead of large loads in a clothes-washing machine.
write on both sides of a piece of paper.
place used newspapers in the garbage.

In Figure 2's social studies item for 6th graders, look carefully at the four answer options. Read each option and see if it might be correct. A "smart" student, I contend, can figure out that choices A, B, and D really would not "conserve resources" all that well; hence choice C is the winning option. Brighter kids will have a better time with this item than their less bright classmates.

But why, you might be thinking, do developers of standardized tests include such items on their tests? The answer is all too simple. These sorts of items, because they tap innate intellectual skills that are not readily modifiable in school, do a wonderful job in spreading out test-takers' scores. The quest for score variance, coupled with the limitation of having few items to use in assessing students, makes such items appealing to those who construct standardized achievement tests.

But items that primarily measure differences in students' in-born intellectual abilities obviously do not contribute to valid inferences about "how well children have been taught." Would we like all children to do well on such "native-smarts" items? Of course we would. But to use such items to arrive at a judgment about educational effectiveness is simply unsound.

Out-of-school learning. The most troubling items on standardized achievement tests assess what students have learned outside of school. Unfortunately, you'll find more of these items on standardized achievement tests than you'd suspect. If children come from advantaged families and stimulus-rich environments, then they are more apt to succeed on items in standardized achievement test items than will other children whose environments don't mesh as well with what the tests measure. The item in Figure 3 makes clear what's actually being assessed by a number of items on standardized achievement tests.

Figure 3. A 6th Grade Standardized Achievement Test Item in Science

A plant's fruit always contains seeds. Which of the items below is not a fruit?

orange
pumpkin
apple
celery

This 6th grade science item first tells students what an attribute of a fruit is (namely, that it contains seeds). Then the student must identify what "is not a fruit" by selecting the option without seeds. As any child who has encountered celery knows, celery is a seed-free plant. The right answer, then, for those who have coped with celery's strings but never its seeds, is clearly choice D.

But what if when you were a youngster, your folks didn't have the money to buy celery at the store? What if your circumstances simply did not give you the chance to have meaningful interactions with celery stalks by the time you hit the 6th grade? How well do you think you'd do in correctly answering the item in Figure 3? And how well would you do if you didn't know that pumpkins were seed-carrying spheres? Clearly, if children know about pumpkins and celery, they'll do better on this item than will those children who know only about apples and oranges. That's how children's socioeconomic status gets mixed up with children's performances on standardized achievement tests. The higher your family's socioeconomic status is, the more likely you are to do well on a number of the test items you'll encounter in a such a test.

Suppose you're a principal of a school in which most students come from genuinely low socioeconomic situations. How are your students likely to perform on standardized achievement tests if a substantial number of the test's items really measure the stimulus-richness of your students' backgrounds? That's right, your students are not likely to earn very high scores. Does that mean your school's teachers are doing a poor instructional job? Of course not.

Conversely, let's imagine you're a principal in an affluent school whose students tend to have upper-class, well-educated parents. Each spring, your students' scores on standardized achievement tests are dazzlingly high. Does this mean your school's teachers are doing a super instructional job? Of course not.

One of the chief reasons that children's socioeconomic status is so highly correlated with standardized test scores is that many items on standardized achievement tests really focus on assessing knowledge and/or skills learned outside of school—knowledge and/or skills more likely to be learned in some socioeconomic settings than in others.

Again, you might ask why on earth would standardized achievement test developers place such items on their tests? As usual, the answer is consistent with the dominant measurement mission of those tests, namely, to spread out students' test scores so that accurate and fine-grained norm-referenced interpretations can be made. Because there is substantial variation in children's socioeconomic situations, items that reflect such variations are efficient in producing among-student variations in test scores.

You've just considered three important factors that can influence students' scores on standardized achievement tests. One of these factors was directly linked to educational quality. But two factors weren't.

What's an Educator to Do?

I've described a situation that, from the perspective of an educator, looks pretty bleak. What, if anything, can be done? I suggest a three-pronged attack on the problem. First, I think that you need to learn more about the viscera of standardized achievement tests. Second, I think that you need to carry out an effective educational campaign so that your educational colleagues, parents of children in school, and educational policymakers understand what the evaluative shortcomings of standardized achievement tests really are. Finally, I think that you need to arrange a more appropriate form of assessment-based evidence.

Learning about standardized achievement tests.

Far too many educators haven't really studied the items on standardized achievement tests since the time that they were, as students, obliged to respond to those items. But the inferences made on the basis of students' test performances rest on nothing more than an aggregated sum of students' item-by-item responses. What educators need to do is to spend some quality time with standardized achievement tests—scrutinizing the test's items one at a time to see what they are really measuring.

Spreading the word.

Most educators, and almost all parents and school board members, think that schools should be rated on the basis of their students' scores on standardized achievement tests. Those people need to be educated. It is the responsibility of all educators to do that educating.

If you do try to explain to the public, to parents, or to policymakers why standardized test scores will probably provide a misleading picture of educational quality, be sure to indicate that you're not running away from the need to be held accountable. No, you must be willing to identify other, more credible evidence of student achievement.

Coming up with other evidence.

If you're going to argue against standardized achievement tests as a source of educational evidence for determining school quality, and you still are willing to be held educationally accountable, then you'll need to ante up some other form of evidence to show the world that you really are doing a good educational job.

I recommend that you attempt to assess students' mastery of genuinely significant cognitive skills, such as their ability to write effective compositions, their ability to use lessons from history to make cogent analyses of current problems, and their ability to solve high-level mathematical problems.

If the skills selected measure really important cognitive outcomes, are seen by parents and policymakers to be genuinely significant, and can be addressed instructionally by competent teachers, then the assembly of a set of pre-test-to-post-test evidence showing substantial student growth in such skills can be truly persuasive.

What teachers need are assessment instruments that measure worthwhile skills or significant bodies of knowledge. Then teachers need to show the world that they can instruct children so that those children make striking pre-instruction to post-instruction progress.

The fundamental point is this: If educators accept the position that standardized achievement test scores should not be used to measure the quality of schooling, then they must provide other, credible evidence that can be used to ascertain the quality of schooling. Carefully collected, nonpartisan evidence regarding teachers' pre-test-to-post-test promotion of undeniably important skills or knowledge just might do the trick.

Right Task, Wrong Tools

Educators should definitely be held accountable. The teaching of a nation's children is too important to be left unmonitored. But to evaluate educational quality by using the wrong assessment instruments is a subversion of good sense. Although educators need to produce valid evidence regarding their effectiveness, standardized achievement tests are the wrong tools for the task.

References

•

Freeman, D. J., Kuhs, T. M., Porter, A. C., Floden, R. E., Schmidt, W. H., & Schwille, J. R. (1983). Do textbooks and tests define a natural curriculum in elementary school mathematics? Elementary School Journal, 83(5), 501–513.

•

Gardner, H. (1994). Multiple intelligences: The theory in practice. Teacher's College Record, 95(4), 576–583.

James "Jim" Popham (1930–2025) was Emeritus Professor in the UCLA Graduate School of Education and Information Studies. At UCLA he won several distinguished teaching awards, and in January 2000, he was recognized by UCLA Today as one of UCLA's top 20 professors of the 20th century.

Popham was a former president of the American Educational Research Association (AERA) and the founding editor of Educational Evaluation and Policy Analysis, an AERA quarterly journal.

He spent most of his career as a teacher and was the author of more than 90 books, 250 journal articles, 50 research reports, and nearly 200 papers presented before research societies. His contributions to education spanned decades, shaping how we think about student assessment and educational evaluation.

Learn More