Are Grades Reliable? Lessons from a Century of Research

Thomas R. Guskey

Premium Resource

Teaching Strategies Curriculum Design & Lesson Planning

Few people today question the premise that students' grades should reflect the quality of their work and not whether their teachers are "hard" or "easy" graders. But how much teacher subjectivity is involved in the grading process, and what do we know about its influence?

On the one hand, early studies of grading reliability—some dating back to the 1800s—clearly were motivated by researchers' dissatisfaction, and sometimes disdain, for teachers' unreliable practices. Our reaction to this, of course, is indignation. On the other hand, the extent of the unreliability in grading identified in these early studies was huge.

Grades for the same work varied dramatically from teacher to teacher, resulting in highly divergent conclusions about students, their learning, and their future studies.

That's not right, either.

So, we dug into 16 individual studies of grading reliability from the early 20th century, plus two early reviews of grading studies, to figure out the most important takeaways.

Early researchers attributed the inconsistency in teachers' grades to one or more of the following: the quality of students' work, the criteria for evaluating the work, teacher severity or leniency, differences in tasks, the grading scale, and teacher error.

The overall finding of these studies—that "grades are unreliable"—has given grades a bad reputation that continues to this day.

Addressing the reliability of grades is a foundational issue. Educators have struggled for over a century to reform grading with modest success. But no amount of reform in grading policies or report card formats will improve grading if the grades reported are not reliable. Despite their age, the early research findings about the reliability of grading give us several practical suggestions for today.

1. Clarify Criteria

Our biggest takeaway from the early studies was the expectation that somehow English teachers, mathematics teachers, history teachers—any teachers—would just "know" how to grade a piece of student work. In researcher Daniel Starch's 1913 analysis, one of the most important factors contributing to unreliability in grading was the difference in emphases teachers placed on different criteria.

There are three steps to solving this problem:

clearly define the criteria,
clearly specify the weight or relative emphasis each criterion should contribute to the grade for the piece of student work, and
apply the criteria and weights consistently.

Assessment theory has made great strides in the last century, and clarity of criteria is now an important foundation for good assessment. Our work with teachers suggests, however, that identifying and describing clear criteria is one of the most difficult things teachers strive to do. Teachers also must make sure these criteria describe students' learning (the quality of students' performance) and not simply how well they followed directions.

Starch and Edward Elliott, for example, sent mathematics exams to math teachers in 1912 with the request to simply "grade them." Today, math rubrics might contain criteria like problem-solving strategies, computation, and communication, focusing graders on specific aspects of the work. Without such criteria, most graders would approach the problem by asking, "Is it right or wrong?"

Getting Rubric Criteria Right

Rubrics have two parts: criteria, or what students are being asked to do, and performance-level descriptions, or how they did. Here, we look at the desired characteristics for rubric criteria.

Are Grades Reliable? Lessons from a Century of Research - table

CHARACTERISTICS The criteria are …	EXPLANATION
Appropriate	Each criterion represents an aspect of a standard curricular goal, instructional goal, or objective that students are intended to learn.
Definable	Each criterion has a clear, agreed-upon meaning that both students and teachers understand.
Observable	Each criterion describes a quality in the performance that can be perceived (seen or heard, usually) by someone other than the person performing.
Distinct from one another	Each criterion identifies a separate aspect of the learning outcomes the performance is intended to assess.
Complete	All the criteria together describe the whole of the learning outcomes the performance is intended to assess.
Arranged along a continuum of quality	Each criterion can be described over a range of performance levels.

Source: Adapted from Brookhart, S.M. (2013). How to create and use rubrics for formative assessment and grading, p. 25. Alexandria, VA: ASCD.

2. Be Consistent

Teachers must be deliberately consistent across all possible factors that could influence grading. There are many recommendations for improving grading consistency, including these by one of us (Susan) and the University of Pittsburgh's Anthony Nitko:

Use a scoring guide, such as a rubric, checklist, or point scheme, to focus decisions on criteria and performance-level requirements. (See sidebar for more information.)
Use a model answer (or two or three to avoid being overly prescriptive) as a reference for your expectations of students. A model answer is not meant to be copied but demonstrates what good work looks like.
For constructed-response assessments, grade all students' responses to one question before moving on to the next question.
Bracket and evaluate separately (whether with a score or feedback) qualities of the work other than those in the criteria (neatness, format, grammar, and mechanics).
Grade work anonymously. For example, cover students' names when grading papers or use digital platforms that report names separately.

3. Use Simple Scales with a Few Distinct Categories

The 0 to 100 scale looks more precise than it really is. Several studies found that the probable error on the 0 to 100 scale is plus or minus five to six points or even greater. In the early 20th century, schools began to shift from reporting percentages to the familiar letter grade scale. It's better to use simpler scales with fewer categories (like A, B, C, D, and F; 4 to 0; or proficiency categories like advanced, proficient, and basic) that don't require such fine distinctions that are likely to be inaccurate.

Using fewer grade categories means educators contend with fewer borderline cases. If grades are simply Pass/Fail, only scores near the one cut-off are subject to misclassification. With letter grades, there are four cut-offs with potential for classification errors. As the number of categories increases, so do the number of borderline cases and the possibility of misclassification.

In addition, using fewer categories usually results in clearer definitions. The difference between Basic and Proficient, for example, can usually be described in a few phrases, whereas the difference between an 89 and a 90 is difficult to articulate. Fewer categories yield better explanations for why a particular grade describes a specific piece of student work. The generally accepted recommendation for dealing with borderline cases is to look at additional evidence of student learning for the standard or learning goal, such as observations of student work during class.

The current resurgence of the 100-point percentage grade scale has been attributed to the use of computerized grading programs that typically are developed by software engineers who have scant knowledge of this history of their unreliability.

Taking these steps will help educators make grades more reliable, more accurate, more meaningful, and far more defensible.

Thomas R. Guskey, PhD, is professor emeritus in the College of Education, University of Kentucky. A graduate of the University of Chicago, he began his career in education as a middle school teacher and later served as an administrator in Chicago Public Schools. He is a Fellow in the American Educational Research Association and was awarded the Association's prestigious Relating Research to Practice Award.

His most recent books include Implementing Mastery Learning; Get Set, Go! Creating Successful Grading and Reporting Systems; and What We Know About Grading: What Works, What Doesn't, and What's Next.