Papay, J. P. (2011). Different Tests, Different Answers. American Educational Research Journal, 48(1):163-193.
Conclusions and Implications
variation in teacher value-added estimates arises from the choice of outcome
than the model specification.
Instead, Papay's results suggest that test timing and inconsistency,
such as measurement error, play a much greater role. In particular,
the finding that the timing of the test alone may produce substantial variation
in teacher productivity estimates across outcome measures raises important
questions for teacher accountability policies.
The analyses presented in this research suggest that the correlations between
teacher value-added estimates derived from three separate reading
tests—the state test, SRI, and SAT—range from 0.15 to 0.58 across a wide
range of model specifications.
Although these correlations are moderately
high, these assessments produce substantially different answers about individual
teacher performance and do not rank individual teachers consistently.
Even using the same test but varying the timing of the baseline and outcome
measure introduces a great deal of instability to teacher rankings.
if a school district were to reward teachers for their performance, it would
identify a quite different set of teachers as the best performers depending
simply on the specific reading assessment used.
Papay's results suggest that test timing also contributes substantially
to differences in teacher effectiveness estimates across outcome measures.
This is an important finding that merits further study.
If policymakers intend to continue using value-added measures to make
high-stakes decisions about teacher performance, more attention should be
paid to the tests themselves. Currently, all value-added estimates of teacher
effectiveness use tests designed to measure student, not teacher, performance.
The ideal properties of tests designed to identify a district’s best teachers may
well differ from those designed to assess student proficiency.
timing of tests must be considered more carefully. For example, the practice of
giving high-stakes tests in early spring may not matter much for inferences
about student performance in the district—having an assessment of student
skills in February may be just as useful as one in May. However, decisions
about timing have substantial implications for teacher value-added estimation.
Given the amount of inaccuracy in any single assessment of teacher
performance—whether based on test scores or observations—combining
multiple sources of information could provide schools and teachers with
a better sense of their performance on a wider range of domains.
While multiple measures may provide a more robust assessment of
teacher performance and may mitigate the effects of measurement error
from using any single test, policymakers and district officials must take
care in deciding how to combine measures. Douglas (2007) found that using
multiple assessments increases evaluation reliability when the measures are
highly related, but this result is not consistent with less correlated measures.
Importantly, additional research is needed into the different implications
of high- and low-stakes tests for estimating teacher effects. Teachers who
appear to perform well using a high-stakes examination but not well with
a low-stakes test may be effectively teaching state standards or may be
engaged in inappropriate coaching.
All value-added models rely on the assumption that teacher effectiveness
can be estimated reliably and validly through student achievement tests.
In practice, the reliability of
student achievement growth is lower than that of the individual tests
themselves. (jc: e.g., 5th grade CST test is different from the 4th grade one, so measuring the gain score can be tricky)
Additional variation in teacher estimates arises from the nature of testing.
Students take tests on different days and at different times of the year.
Because students, particularly those in urban schools, have relatively high
absenteeism and mobility, the students present to take each test may vary
substantially. Thus, teacher value-added estimates may vary across outcomes
in part because different samples of students take each test.
As seen in Table 5 (on p. 180), approximately half of the teachers who would earn
a $7,000 bonus using the state test would lose money if the district used the
The average teacher in the district would see his or her pay changed by
$2,178 simply by switching outcome measures. Interestingly, the instability
in teacher estimates across outcome measures is much greater for teachers
in the middle two quartiles. (p181)
Papay found that differences in test content and scaling do not
appear to explain the variation in teacher effects across outcomes in this district.
The different samples of students who take each of the tests contribute
somewhat, but they do not account for most of the differences. Test timing
appears to play a greater role in producing these differences. Nonetheless, it
does not explain all of the variation, suggesting that measurement error also
contributes to the instability in teacher rankings. (p183)
Papay made comparisons that suggest
that summer learning loss (or gain) may produce important differences in
teacher effects. Here, the fall-to-fall estimates attribute one summer’s learning
loss to the teacher, while the spring-to-spring estimates attribute a different
summer’s loss. Thus, the fact that the fall-to-fall and spring-to-spring
estimates produce substantially different answers likely reflects, in part,
the inclusion of a different summer in each estimate. (p187)