18 February 2019

Is testing good for education?

This post was first published on the Centre for Education Economics website. 
I blogged recently about a new RISE working paper by Annika Bergbauer, Eric Hanushek, and Ludger Woessmann, which finds that:
“standardized external comparisons, both school-based and student-based, is associated with improvements in student achievement.”
William Smith pointed me to his rebuttal blog written with Manos Antoninis, which argues that there are “multiple weaknesses in their analysis that undermine their conclusions”.

This blog is my attempt to make sense of the disagreement. The main issue appears to me to be a misunderstanding by Antoninis & Smith (“AS” from here on) of the mechanism proposed by Bergbauer, Hanushek, and Woessmann (BHW). AS presume that the main mechanism through which testing is hypothesised to improve outcomes is through school choice (allowing parents to shift their students to schools with better test scores) or through punitive government accountability for teachers and schools. But BHW make it clear that their main focus is on the principal-agent relationship between parents as the principal and both students and teachers as their agents. Parents can’t observe the effort made by students and teachers, but standardized testing can provide them with a proxy indicator for effort. This should induce greater effort from both students and teachers. This proposed mechanism has nothing to do with school choice or accountability from government.

First AS argue that
“Our review of the evidence found that evaluative policies promoting school choice exacerbated disparities by further advantaging more privileged children (pp. 49-52).”
This review of the evidence in pp 49-52 of the UNESCO Global Monitoring Report focuses on policies designed to promote school choice. But that is not at all the focus of the BHW analysis, which is on policies that allow for the comparison of schools and students with the purpose of incentivising greater effort. School choice doesn’t need to have anything to do with it. As BHW write:
“That is the focus of this paper: By creating outcome information, student assessments provide a mechanism for developing better incentives to elicit increased effort by teachers and students, thereby ultimately raising student achievement levels to better approximate the desires of the parents”
Second, AS argue that
“punitive systems had unclear achievement effects but troublesome negative consequences, including removing low-performing students from the testing pool and explicit cheating (pp. 52-56).”
As mentioned above, the proposed mechanism in BHW does not at all require a punitive system. BHW write
“accountability systems that use standardized tests to compare outcomes across schools and students produce greater student outcomes. These systems tend [my emphasis] to have consequential implications and produce higher student achievement than those that simply report the results of standardized tests.”
Having said that, there are some flaws in the literature review cited by AS. This section first cites studies on four individual countries (US, Brazil, Chile, South Korea), without noting that there are significantly positive results from two of them. One of the two papers they cite on Brazil (IDados 2017) concludes that there was “a large, continuous improvement in all those years in both absolute and relative terms when compared to other municipalities in the Northeastern region and in Brazil as a whole ” and “it is very likely that [the reform] is at least partially responsible for the changes.” On Chile, a paper not cited as it was published in 2017 just after the review was completed (Murnane et al) found that “On average, student test scores increased markedly and income-based gaps in those scores declined by one-third in the five years after the passage of [the reform]”.

Next the review cites two papers (Yi 2015; Gándara and Randall (2015) that present correlational analysis with no attempt to address any potential bias from omitted variables or reverse causality. The latter study is based on a small sub-sample of the fuller data used by BHW.

Next AS take issue with the way that BHW construct their 4 categories of test usage. For ease of reference I first reproduce below the 4 categories, along with the wording of the questions that go into constructing each category.

1. Standardized External Comparison
  • “In your school, are assessments of 15-year old students used to compare the school to district or national performance?” (PISA)
  • existence of national/central examinations at the lower secondary level (OECD, EAG)
  • National exams (primary) (Euryadice (EACEA))
  • Central exit exams end secondary (Leschnig, Schwerdt, and Zigova (2017))

2. Standardized Monitoring
  • “Generally, in your school, how often are 15- year-old students assessed using standardized tests?” (PISA)
  • “During the last year, have [tests or assessments of student achievement] been used to monitor the practice of teachers at your school?” (PISA)
  • “In your school, are achievement data … tracked over time by an administrative authority[?]”

3. Internal testing
  • whether assessments are used “to inform parents about their child’s progress.”
  • use of assessments “to monitor the school’s progress from year to year.”
  • “achievement data are posted publicly (e.g. in the media).” (vaguely phrased and is likely to be understood by school principals to include such practices as posting the school mean of the grade point average of a graduating cohort, derived from teacher-defined grades rather than any standardized test, at the school’s blackboard.)

4. Internal teaching monitoring
  • whether assessments are used “to make judgements about teachers’ effectiveness.”
  • practice of teachers is monitored through “principal or senior staff observations of lessons.”
  • “observation of classes by inspectors or other persons external to the school” are used to monitor the practice of teachers.

First, AS argue that question 3c should really fall under category 1. The effect of this question on outcomes is primarily statistically insignificant, though for Maths and Science the direction of the coefficients in the interacted model are the same as the other variables in category 1 (positive in the base model, with a negative coefficient on the interaction with initial score). Would adding this one variable to the 4 variables already in the category make the results statistically insignificant overall? I think probably not, but can’t say for sure without looking at the raw data.

Second, AS claim that question 4a should really fall under category 1 or 2. This claim seems debateable. The theoretical mechanism that BHW put forward is that providing credible information to parents induces greater effort from teachers. This use of testing is clearly internal to the school, and could clearly mean internal school assessments rather than necessarily standardized assessments that allow for external comparison with teachers at other schools.

Third, AS criticise the inclusion of high stakes student assessments as indicators, as by placing the stakes on students and not schools they do not relate to accountability from government. But this is not what BHW claimed was driving the effect.

Fourth, AS suggest the use standardized testing in grade 15 may be effectively “teaching to the test”. This seems odd to me - they clearly aren’t literally teaching to the test because it is a different test. BHW are looking at the effect of introducing high-stakes national standardized testing on student results in a totally separate, low-stakes sample-based test (PISA). AS then don’t really address the argument that “teaching to the test” can also be a positive thing if the test is well-designed and includes a good sample of the things that students are expected to have learnt.

Finally, AS focus only on those results that are statistically significant in the baseline model (estimating the average effect across all countries). However they miss a really important conclusion from the paper which is about heterogeneity. The effects of testing are largest for the weakest performing systems. This is clear in Figure 3.

Looking at the interacted model (Table A5), both of the other 2 questions in category 2 (2b and 2c) are statistically significant.

To sum up, there are weaknesses in the interpretation by AS of BHW which undermine their criticism. BHW focus on the role that testing can play in increasing the effort of students and teachers, with or without government accountability systems. In addition, the review of government accountability systems presented in the UNESCO Global Monitoring Report also has weaknesses, and presents an unduly negative picture. My prior remains that standardized testing plays a positive role, particularly in weak systems.

Thanks to Gabriel Heller-Sahlgren, William Smith, and Manos Antoninis for comments on a draft of this post. This acknowledgement clearly does not imply that Smith and Antoninis agree with this post - they don’t!

No comments:

Post a Comment