Through electronic record keeping, today's schools have access to data that give us extraordinary power to improve the teaching process. How can we use such data to their full potential? After years of reflection, Bellmore-Merrick Central High School District in New York developed a plan to use data to identify excellent teachers and improve teacher performance. We based our plan on two key assumptions: that teachers have the greatest impact on student achievement; and that student test results, at least to some extent, provide a valid and reliable reflection of a teacher's performance.
A number of recent studies support the first assumption, which has long been intuitive to educators (Marzano, Pickering, & Pollock, 2001). The results of Wright, Horn, and Sanders's 1997 study indicated that the most important factor affecting student learning is the teacher. In addition, the results show wide variation in effectiveness among teachers. The immediate and clear implication of this finding is that seemingly more can be done to improve education by improving the effectiveness of teachers than by any other single factor. (p. 63)
The second assumption requires that we have some degree of faith in the best of our state assessments. We recognize that New York State's Regents examinations may not be ideal assessments, but we believe that they provide sound measures of academic achievement. They are also the only objective measures that we currently have to assess student learning in typical secondary school subjects, such as U.S. history, English, social studies, science, mathematics, and foreign languages.
On the basis of these assumptions, our district (three high schools and two middle schools) began analyzing end-of-year examination results a number of years ago in an attempt to improve teacher performance. Our system evolved through both successes and failures in analyzing test data—and, most important, through conversations with teachers and administrators about how data analysis can best inform instruction.
The Power of Data Warehousing
Data warehousing and data mining technology have changed a laborious chore—preparing data collection forms mostly by hand—into an exciting problem-solving exercise with the goal of improving instructional performance. The term data warehousing describes the process of storing all demographic and test data in one electronic location where we can mine it to sift out useful information. The data analyses can be both longitudinal (covering many years for the same students or teachers) and latitudinal (representing many parameters for a given year). Currently, the Nassau County Board of Cooperative Educational Services plans to expand data warehousing services for its 20 member districts, including Bellmore-Merrick.
Figure 1 shows a sample data warehousing cube, which summarizes teacher and student performance on a sample New York State Regents assessment. The cube lists end-of-year results for Teacher X, one of a number of teachers teaching this course. It includes the teacher's end-of-year course grades as well as her students' end-of-year examination grades. On the right side of the chart, in the shaded boxes, are the average grades that Teacher X's students earned on other examinations (longitudinal data) when they were in 8th grade, allowing us to see the relative strength of the students in these classes.
Figure 1. Sample of a Regents Examination Teacher Cube
Down the left side of the chart, we can view latitudinal data, such as similar data for each of Teacher X's classes, which might suggest whether the period of the day had some influence on instruction. We can also see how the classes of other teachers in the same school fared on this test. Farther down, we can compare our results with neighboring schools, districts, and the whole county, always looking at students' longitudinal test data to make sure that we compare results of similar groups of students. It is worth noting that all state exams are group-graded by teachers so that individual teachers' grading practices do not introduce additional variables.
Through the miracle of data mining, we can “drill through” the data to disaggregate any parameter we wish. For instance, we could view the data by student, by gender, by ethnicity, by special education students only, or by any other chosen parameter. What powerful information to put into the hands of schools and teachers!
Although Bellmore-Merrick has had many years of experience in analyzing test results, the introduction of data warehousing and data mining now allows us to electronically store all data in one place and to call forth and re- configure specific information with minimal effort. The two analyses that we have found most instructive for teachers are teacher-to-teacher comparisons and school-to-school comparisons.
Teacher-to-Teacher Comparisons
If conducted with careful reflection, data analyses for individual teachers as shown in Figure 1 provide both teachers and administrators with a powerful tool. For example, we can compare the teacher's test results with the department's course average, recognizing that each teacher's results are part of that course average.
Also, the more data we review, the more confidently we can draw our inferences. For instance, if we see that a particular teacher has average students for three consecutive years who perform below their classmates, we can conclude that the teacher's effectiveness is below average, allowing supervisors to offer assistance where it is most needed. Including large numbers of students in the comparison makes our conclusions even more likely to be accurate.
Supervisors and administrators must guard, however, against drawing hasty conclusions. Before using data analysis to make judgments about a teacher's effectiveness, it is important to “mine the data” for other factors that might have influenced student performance. The first parameter we review is class enrollment, which tells us the probability that any disparities we uncover are due to other factors.
For instance, if a teacher has a total enrollment of only 20 students, any conclusions we draw are far more suspect than if that teacher had 75 or more students. Also, multiple sections are much more likely to reveal a true instructional pattern than just one isolated section, which may be the result of scheduling anomalies that bring together a group of atypical students (perhaps above or below average).
Thus, before giving credence to such single-section results, we must firmly establish the nature of the class population—a task that we can accomplish through mining data for previous performance by these students. In one illustrative case, two teachers had markedly different results in Advanced Placement (AP) calculus. One of the two calculus sections was scheduled at the same time as the single section of AP physics. All the AP physics students were therefore scheduled into the other section of calculus. AP physics students are among the best math students in the school, which we confirmed when we reviewed their previous scores on mathematics Regents exams. Therefore, the calculus scores in their section were considerably higher than those in the section without AP physics students. In this case, a scheduling conflict—not a difference in teacher performance—probably caused the disparity in results.
We have also discovered an interesting anomaly, most apparent when only two teachers share all the students for a given course. One teacher may regularly get a larger percentage of the lower-achieving students because that teacher has gained the reputation with parents and guidance counselors of being supportive of students with special needs. This accommodating teacher typically ends the year with larger class sizes because sympathetic counselors and administrators place lower-achieving students in his or her sections, sometimes without even realizing why they are doing so. The other teacher, who provides less support, tends to lose lower-achieving students by attrition. If data analysis confirms this imbalance in the level of past performance, the teacher with the better results may actually require intensive supervision, especially if his or her more able students only —ally outperformed the supportive teacher's lower-performing students.
Frequently, when we advise a teacher of problematic results, the teacher claims that his or her students were an especially weak group compared with colleagues' students. In such cases, we used to perform an extended analysis to rule out this possibility. Now, with data warehousing available, a once laborious process can occur almost instantaneously.
For instance, if a teacher's students have poor results on the grade 11 U.S. history Regents examination, we can quickly check those students' scores on exams taken previously, including the grade 10 global history Regents examination. If their global history scores had been equal to or better than the school average the previous year, the teacher can no longer argue that the current low U.S. history scores were caused by a lower-achieving group of students. Such longitudinal comparison of student records (same subject, different grade) usually convinces a teacher that something about his or her teaching has caused the poor results—especially if the pattern occurs for two or more years in a row and with large numbers of students. Gently convincing a teacher of the need for altered teaching techniques or strategies is often an essential first step toward improved instruction.
School-to-School Comparisons
Many districts have only one high school, making school-to-school comparisons difficult without cooperation among districts. Bellmore-Merrick is fortunate in having three high schools and two middle schools, enabling us to make essential school-to-school comparisons that allow us to establish the same expectations for all schools.
Assume for a moment that all teachers in a given department exhibit poor performance. Would a teacher-by-teacher comparison point out teaching weaknesses? Obviously not! Only by making school-to-school comparisons can we create a more global picture.
For example, chemistry classes with a Regents examination passing rate of 65 percent may become the expected norm if all teachers in a school perform at that level year after year. But if the two neighboring schools with similar demographics have passing rates of 80 percent, we can uncover likely teaching weaknesses at the first school. The more schools that we include in the comparison, the better a chance we have to discover poor or, just as important, exemplary teacher performance. In all cases, we must also make longitudinal comparisons, looking at students' previous test performance to make certain that we are comparing results for students who entered the classroom with approximately equal ability.
Using the Comparisons Effectively
Any analysis of teacher performance should be aimed at improving instruction rather than merely sorting or rating teachers or schools. When our district identifies a teacher as needing intensive supervision (after a few years of measured poor performance), the supervising administrator writes an Individual Supervisory Plan aimed at helping the teacher improve. This plan includes regular meetings, analysis of lesson plans and exams, a midyear exam to check teacher progress, observation of colleagues, additional formal and informal classroom observations, and recommendations for appropriate inservice courses.
Recognizing teachers who have exceptionally strong test results is just as important as identifying those with substandard results. And just as we want to recommend revised teaching techniques and strategies for the less successful teacher, we want to identify and then promote to others the effective strategies of the strong teacher. Making these distinctions is especially important as we add item analysis to the services offered by the data warehouse, enabling us to pinpoint the exact test items that the teacher taught effectively or ineffectively. Current analyses of schools and teachers show that virtually all teachers exhibit individual test items where students either underperform or outperform classmates of equal ability. Directing teachers to their strengths and weaknesses, test item by test item, is a powerful outcome of item analysis.
Lest teachers teach only to the test, we must also take care not to use state exams as the only measure of teacher evaluation. But experience has shown us that teachers whose students do well on state and national tests are typically those who are sought after by parents and students and recognized by other teachers and administrators as exceptional.
In the end, thorough analysis of teacher results plays a vital role in any coordinated plan to improve instruction. To work, such analysis requires the practiced eye of an enlightened supervisor able to account for a variety of variables affecting teacher performance. When we follow this analysis with targeted staff development, our students reap the rewards.