Tuesday, January 25, 2011

Different Tests, Different Answers - Papay (2011) - VAM

Papay, J. P. (2011). Different Tests, Different Answers. American Educational Research Journal, 48(1):163-193.

Conclusions and Implications
Much more variation in teacher value-added estimates arises from the choice of outcome than the model specification.

Instead, Papay's results suggest that test timing and inconsistency, such as measurement error, play a much greater role. In particular, the finding that the timing of the test alone may produce substantial variation in teacher productivity estimates across outcome measures raises important questions for teacher accountability policies.

The analyses presented in this research suggest that the correlations between teacher value-added estimates derived from three separate reading tests—the state test, SRI, and SAT—range from 0.15 to 0.58 across a wide range of model specifications.

Although these correlations are moderately high, these assessments produce substantially different answers about individual teacher performance and do not rank individual teachers consistently. Even using the same test but varying the timing of the baseline and outcome measure introduces a great deal of instability to teacher rankings.

Therefore, if a school district were to reward teachers for their performance, it would identify a quite different set of teachers as the best performers depending simply on the specific reading assessment used.

Papay's results suggest that test timing also contributes substantially to differences in teacher effectiveness estimates across outcome measures. This is an important finding that merits further study.

If policymakers intend to continue using value-added measures to make high-stakes decisions about teacher performance, more attention should be paid to the tests themselves. Currently, all value-added estimates of teacher effectiveness use tests designed to measure student, not teacher, performance. The ideal properties of tests designed to identify a district’s best teachers may well differ from those designed to assess student proficiency.

Furthermore, the timing of tests must be considered more carefully. For example, the practice of giving high-stakes tests in early spring may not matter much for inferences about student performance in the district—having an assessment of student skills in February may be just as useful as one in May. However, decisions about timing have substantial implications for teacher value-added estimation.

Given the amount of inaccuracy in any single assessment of teacher performance—whether based on test scores or observations—combining multiple sources of information could provide schools and teachers with a better sense of their performance on a wider range of domains.

While multiple measures may provide a more robust assessment of teacher performance and may mitigate the effects of measurement error from using any single test, policymakers and district officials must take care in deciding how to combine measures. Douglas (2007) found that using multiple assessments increases evaluation reliability when the measures are highly related, but this result is not consistent with less correlated measures.

Importantly, additional research is needed into the different implications of high- and low-stakes tests for estimating teacher effects. Teachers who appear to perform well using a high-stakes examination but not well with a low-stakes test may be effectively teaching state standards or may be engaged in inappropriate coaching.

All value-added models rely on the assumption that teacher effectiveness can be estimated reliably and validly through student achievement tests.

In practice, the reliability of student achievement growth is lower than that of the individual tests themselves. (jc: e.g., 5th grade CST test is different from the 4th grade one, so measuring the gain score can be tricky)

Additional variation in teacher estimates arises from the nature of testing. Students take tests on different days and at different times of the year. Because students, particularly those in urban schools, have relatively high absenteeism and mobility, the students present to take each test may vary substantially. Thus, teacher value-added estimates may vary across outcomes in part because different samples of students take each test.

As seen in Table 5 (on p. 180), approximately half of the teachers who would earn a $7,000 bonus using the state test would lose money if the district used the SRI instead.

The average teacher in the district would see his or her pay changed by $2,178 simply by switching outcome measures. Interestingly, the instability in teacher estimates across outcome measures is much greater for teachers in the middle two quartiles. (p181)

Papay found that differences in test content and scaling do not appear to explain the variation in teacher effects across outcomes in this district. The different samples of students who take each of the tests contribute somewhat, but they do not account for most of the differences. Test timing appears to play a greater role in producing these differences. Nonetheless, it does not explain all of the variation, suggesting that measurement error also contributes to the instability in teacher rankings. (p183)

Papay made comparisons that suggest that summer learning loss (or gain) may produce important differences in teacher effects. Here, the fall-to-fall estimates attribute one summer’s learning loss to the teacher, while the spring-to-spring estimates attribute a different summer’s loss. Thus, the fact that the fall-to-fall and spring-to-spring estimates produce substantially different answers likely reflects, in part, the inclusion of a different summer in each estimate. (p187)

No comments:

Post a Comment