Roving Bandit: evaluation

Showing posts with label evaluation. Show all posts

09 March 2025

The key to better education systems is accountability. So how on earth do we do that?

And what do we even actually mean when we talk about accountability?

Perhaps the key theme emerging from research on reforming education systems is accountability. But accountability means different things to different people. To start with, many think first of bottom-up (‘citizen’ or ‘social’) accountability. But increasingly in development economics, enthusiasm is waning for bottom-up social accountability as studies show limited impacts on outcomes. The implicit conclusion then is to revisit top-down (state) accountability. As Rachel Glennerster (Executive Director of J-PAL) wrote recently:

"For years the Bank and other international agencies have sought to give the poor a voice in health, education, and infrastructure decisions through channels unrelated to politics. They have set up school committees, clinic committees, water and sanitation committees on which sit members of the local community. These members are then asked to “oversee” the work of teachers, health workers, and others. But a body of research suggests that this approach has produced disappointing results."

One striking example of this kind of research is Ben Olken’s work on infrastructure in Indonesia, which directly compared the effect of a top-down audit (which was effective) with bottom-up community monitoring (ineffective).

So what do we mean by top-down accountability for schools?

Within top-down accountability there are a range of methods by which schools and teachers could be held accountable for their performance. Three broad types stand out:

Student test scores (whether simple averages or more sophisticated value-added models)
Professional judgement (e.g. based on lesson observations)
Student feedback

The Gates Foundation published a major report in 2013 on how to “Measure Effective Teaching”, concluding that each of these three types of measurement has strengths and weaknesses, and that the best teacher evaluation system should therefore combine all three: test scores, lesson observations, and student feedback.

By contrast, when it comes to holding head teachers accountable for school performance, the focus in both US policy reform and research is almost entirely on test scores. There are good reasons for this - education in the US has developed as a fundamentally local activity built on bottom up accountability, often with small and relatively autonomous school districts, with little tradition of supervision by higher levels of government. Nevertheless, as Helen Ladd, a Professor of Public Policy and Economics at Duke University and an expert in school accountability, wrote on the Brookings blog last year:

"The current test based approach to accountability is far too narrow … has led to many unintended and negative consequences. It has narrowed the curriculum, induced schools and teachers to focus on what is being tested, led to teaching to the test, induced schools to manipulate the testing pool, and in some well-publicized cases induced some school teachers and administrators to cheat.

Now is the time to experiment with inspections for school accountability …

Such systems have been used extensively in other countries … provide useful information to schools … disseminate information on best practices … draw attention to school activities that have the potential to generate a broader range of educational outcomes than just performance on test scores … [and] treats schools fairly by holding them accountable only for the practices under their control …

The few studies that have focused on the single narrow measure of student test scores have found small positive effects."

A report by the US think tank “Education Sector” also highlights the value of feedback provided through inspection systems to schools.

"Like many of its American counterparts, Peterhouse Primary School in Norfolk County, England, received some bad news early in 2010. Peterhouse had failed to pass muster under its government’s school accountability scheme, and it would need to take special measures to improve. But that is where the similarity ended. As Peterhouse’s leaders worked to develop an action plan for improving, they benefited from a resource few, if any, American schools enjoy. Bundled right along with the school’s accountability rating came a 14-page narrative report on the school’s specific strengths and weaknesses in key areas, such as leadership and classroom teaching, along with a list of top-priority recommendations for tackling problems. With the report in hand, Peterhouse improved rapidly, taking only 14 months to boost its rating substantially."

In the UK, ‘Ofsted’ reports are based on a composite of several different dimensions, including test scores, but also as importantly, independent assessments of school leadership, teaching practices and support for vulnerable students.

There is a huge lack of evidence on school accountability

This blind spot on school inspections isn’t just a problem for education in the US, though. The US is also home to most of the leading researchers on education in developing countries, and that research agenda is skewed by the US policy and research context. The leading education economists don’t study inspections because there aren’t any in the places they live.

The best literature reviews in economics can often be found in the “Handbook of Economics” series and the Journal of Economic Perspectives (JEP). The Handbook article on "School Accountability" from 2011 exclusively discusses the kind of test-based accountability that is common in the US, with no mention of the kind of inspections common in Europe and other countries at all. A recent JEP symposium on Schools and Accountability includes a great article by Isaac Mbiti, a Research on Improving Systems of Education (RISE) researcher, on ’The Need for Accountability in Education in Developing Countries” which includes; however, only one paragraph on school inspections. Another great resource on this topic is the 2011 World Bank book, "Making Schools Work: New Evidence on Accountability Reforms”. This 'must-read' 250-page book has only two paragraphs on school inspections.

This is in part a disciplinary point - it is mostly a blind-spot of economists. School inspections have been studied in more detail by education researchers. But economists have genuinely raised the bar in terms of using rigorous quantitative methods to study education. In total, I count 7 causal studies of the effects of inspections on learning outcomes - 3 by economists and 4 by education researchers.

Putting aside learning outcomes for a moment, one study from leading RISE researchers, Karthik Muralidharan and Jishnu Das (with Alaka Holla and Aakash Mohpal), in rural India finds that “increases in the frequency of inspections are strongly correlated with lower teacher absence”, which could be expected to lead to more learning as a result. However, no such correlation was found for other countries in a companion study (Bangladesh, Ecuador, Indonesia, Peru, and Uganda).

There is also fascinating qualitative work by fellow RISE researcher, Yamini Aiyar (Director of the ‘Accountability Initiative’ and collaborator of RISE researchers Rukmini Banerji, Karthik Muralidharan, and Lant Pritchett) and co-authors, that looks into how local level education administrators view their role in the Indian state of Bihar. The most frequently used term by local officials to describe their role was a “Post Officer” - someone who simply passes messages up and down the bureaucratic chain - “a powerless cog in a large machine with little authority to take decisions." A survey of their time use found that on average a school visit lasts around one hour, with 15 minutes of that time spent in a classroom, with the rest spent “checking attendance registers, examining the mid-day meal scheme and engaging in casual conversations with headmasters and teacher colleagues … the process of school visits was reduced to a mechanical exercise of ticking boxes and collecting relevant data. Academic 'mentoring' of teachers was not part of the agenda.”

At the Education Partnerships Group (EPG) and RISE we’re hoping to help fill this policy and research gap, through nascent school evaluation reforms supported by EPG in Madhya Pradesh, India, that will be studied by the RISE India research team, and an ongoing reform project working with the government of the Western Cape in South Africa. Everything we know about education systems in developing countries suggests that they are in crisis, and that a key part of the solution is around accountability. Yet we know little about how school inspections - the main component of school accountability in most developed countries - might be more effective in poor countries. It’s time we changed that.

This post appeared first on the RISE website.

03 February 2025

The results agenda is yet to take hold in the UK

DFID Annual Budget: £10 billion

Current (domestic) UK Government "Major projects expenditure" with no plans to evaluate impact or value for money: £49 billion (NAO 2013: Evaluation in Government)

15 February 2025

When unintended consequences go... well!

After the disappointment of the blank results from the Duflo study on labour market policy in France, a new CSAE study by Imbert and Papp has some more encouraging results - programme side-effects which go in a positive direction.

They find that the "National Rural Employment Guarantee Scheme" (NREGA) in India has a positive impact on private sector wages by bidding up the price of labour. The indirect gains to poor labourers from the higher private sector wages are big - about half of the value of the direct gains to participants from the public works programme. Of course, this increase in wages represents a loss for buyers of labour, but these tend to be people in the top 20%.

A couple of interesting implications that the authors note - first, this is evidence against the Lewis model of surplus labour which can be cheaply tapped for capitalist expansion.

Second - differences in the political power and organisation of landlord farmers may help explain differences in the implementation of the scheme across states.

Finally, a reminder that this is based on nationally representative data in a country of 1.2 billion people, and a programme which spends $9 billion a year and reaches millions of households. Take that, randomistas.

21 January 2025

When rigorous impact evaluation does make quite a big difference

If you care at all about unemployment and labour market policy, or really about much of social policy, this new paper from Esther Duflo and co-authors should have you quite worried.

The policy - pay a private provider for each unemployed person that they get into a job.

The result (part 1) - the policy was successful at getting unemployed participants into jobs.

The result (part 2) - almost all of these jobs were just taken from other people who would otherwise have got them. Pure displacement. No net change in unemployment.

Most impact evaluations don't measure such "spillover" effects or "externalities", because they are really hard to measure (neither do most non-randomised evaluations.., this is not a criticism of RCTs).

Ignoring externalities, we would have thus concluded, for example, that 100,000 euros invested in the program would lead 9.7 extra people to find a job within eight months. Since the eff ect disappears by 12 months, this already appears to be quite expensive, at about 10,000 euros for a job found on average four months earlier. But at least, it is not counterproductive. With externalities, investing 100,000 euros leads to no improvement at all.

Bruno Crepon, Esther Duflo, Marc Gurgand, Roland Rathelot, and Philippe Zamoray (2012), Do labor market policies have displacement effects? Evidence from a clustered randomized experiment

28 October 2024

Cash transfers in Northern Kenya

The BBC have a short clip here of the new DFID Minister Justine Greening visiting the Hunger Safety Net Programme in Northern Kenya, where eligible households are said to get $40 every couple of months via a "Smartcard."

OPM is managing the evaluation of the project: you can see the Year 1 impact report here.

20 July 2025

Doing governance is hard #163826353

First the good news: a new evaluation report from a community driven reconstruction programme in Eastern Congo (HT: Sarah Bailey) shows yet again that it is possible to evaluate messy hard-to-measure governance interventions using rigorous quantitative methods. IPA and JPAL have an evaluation of a similar programme in Sierra Leone.

Now the bad news: this kind of design only works with interventions at the local level because you need a large sample size of units - in this case villages. National-level interventions give you a sample size of one, not very conducive for quantitative analysis.

And the worse news: these local level governance interventions don't seem to work. Both this Congo study and the Sierra Leone study find no improvement in local governance.

Now for some better news: we actually already know what a lot of the national-level governance interventions that need to be done are. They are boring. Things like audits of government accounts. South Sudan has finally just published the audit of the 2007 accounts, to apparent astonishment and outrage by parliamentarians. It's pretty grim reading. Though I'm not sure how anyone is actually honestly surprised. Still, it's probably not totally outlandish to think that audits done a bit quicker than 5 years after the fact might improve budget governance.

And now for the worst news of all: much of this easy, boring, national-level governance stuff is around accountability - which means the national leadership intentionally putting in place limits on its own power. Binding its own hands. You have to be an incredibly enlightened leader to purposely reduce your own power. The whole point of the politics game is increasing your own power. Which means that you need people to demand accountability and force leaders into action. And despite all the talk about governance from the international community, we aren't really interested or able to be the ones doing the demanding.

21 May 2025

The Lancet's editors don't get evaluation (sadface)

So Matt beat me to the punch on Friday on the Lancet Millennium Village retraction. Since then I've being trying to think of a polite way of expressing my total dismay and despair at the tripe written by the Lancet editors in response to the retraction (for which, by the way, a little bit of Kudos to Pronyk et al).

The Lancet editors write:

The Millennium Villages project team has quickly and commendably corrected the record after understanding the validity of the challenge it received. But the withdrawal of this element of the paper does not detract from the larger result—namely, that after 3 years Millennium Villages saw falls in poverty, food insecurity, stunting, and malaria parasitaemia, together with increases in access to safe water and sanitation.

Which is just total nonsense. For all we know, poverty fell in the Villages at the exact same rate as everywhere else. That is not an important result to be celebrated. I challenged Lancet editor Richard Horton on twitter as to why he would continue to emphasise this non-result, and he responded with yet more nonsense;

richardhorton1 @rovingbandit To be fair, there were falls in each of the 5 MDG-1 poverty/nutrition measures, but these were not statistically significant. 18 May 2025 from web

That isn't even true. The first of the measures - wealth - is the opposite of poverty. It is *wealth* that fell (statistically insignificantly) in the Villages relative to comparisons. I despair. And kind of question my own sanity. Despite what Tim Worstall says, I'm really not a scientist, but its pretty galling that people say economics is not a science like the physical sciences when this is the kind of guff published by the world's top medical journal.

Bill Easterly has a whole long list here of more terrible social science published in medical journals. At the bottom of the post, Ben Goldacre comments

i think journals publishing things outside of their field of expertise is risky, but i wld caution against developing a world view that economics journals are in a better shape overall than medical ones. as someone who flits into both, there are lots of things that are routine in medical journals, to a greater or lesser extent, but notably almost unheard of in economics. stuff like declarations of conflict of interest, structured write-ups, registering a protocol in advance of doing a study, etc. all of which wld be great to see more of outside medicine.

All of which is true. In particular I am struck by how easily readable a short, structured, 4 page Lancet write-up is. There are definitely lessons to be learnt across disciplines both ways. It's just an incredibly sad state of affairs that one of the lessons that journals of medicine, the discipline that gave us randomized controlled trials, needs to learn from economics, is a more careful attention to statistics and causality.

14 May 2025

Millennium Villages: impact evaluation is almost besides the point

A lot has been said about evaluation and the impact of the Millennium Villages, most of which boils down to:

"What is the impact of the Millennium Village package of interventions on the area in question?"

The really depressing part though is that this is actually the least interesting question. Chances are that throwing in a whole bunch of extra inputs to a community will create some outputs, and some impact. The whole point of the Millennium Villages though is to provide a model for the rest of rural Africa to follow. The really interesting question is whether African governments have the desire and capability to manage a massive and complex scaling up of integrated service delivery across rural Africa.

A point which basically belongs to Bill Easterly.

Mr. Easterly argues that the Millennium approach would not work on a bigger scale because if expanded, “it immediately runs into the problems we’ve all been talking about: corruption, bad leadership, ethnic politics.”

He said, “Sachs is essentially trying to create an island of success in a sea of failure, and maybe he’s done that, but it doesn’t address the sea of failure.”

Mr. Easterly and others have criticized Mr. Sachs as not paying enough attention to bigger-picture issues like governance and corruption, which have stymied some of the best-intentioned and best-financed aid projects.

A proper randomised evaluation could give you a good estimate of the cost-effectiveness of the island. A difference-in-difference estimate could give you a slightly worse estimate. Doing a fake difference-in-difference with unreliable recall baselines, arbitrarily selected control villages, misrepresented results, and mathematical errors, will give you a pretty awful estimate. But either way, you are missing the main point, which is about scale and replication, and how that works.

How feasible would it really be to replicate something like this on a national level in Ghana? How exactly would it work? Do the systems of accountability and capability exist at local levels to manage all of these projects? How would coordination and planning work between national ministries and their sectoral plans, and local level priorities?

The Millennium Village project seems to grasp vaguely at these issues but ultimately brush them under the table. From a MV project report:

Another challenge in some sites is insufficient capacity of local government to take full ownership of MV activities. This is manifested in unfulfilled pledges to perform mandated roles, unsatisfactory maintenance of infrastructure, and insufficient involvement of local elected officials. MV site teams are addressing these challenges by agreeing to jointly implement interventions targeted at improving the performance of sub‐district governments, increasing sensitization and engagement of local government officials, increasing joint monitoring of MV activities in communities, and developing training plans in technical, managerial, and planning skills for local government officials.

Or : "we have no clue how to fix the systemic implementation challenges"

An anonymous aidworker writes on his blog Bottom-up thinking

I’ve noticed around here, normally sloth-like civil servants who won’t even sit in a meeting without a generous per diem rush around like lauded socialist workers striving manly (or womanly) in the name of their country when a bigwig is due to visit, working into the night and through weekends, all without any per diems...

I fear all the achievements of the MVP will wash up against the great brick wall that is a change resistant bureaucracy.

None of this is to say that the situation is hopeless. It isn't. In particular there are elements of the Millennium Village package which are proven to be effective, cheap, and don't require complicated systems of governance and accountability. Namely distributing insecticide-treated bednets. Aid money can provide them easily, sustainability is less of a concern than other interventions, and you can buy them right now. Check out Givewell for a rigorous independent assessment (and recommendation) of the Against Malaria Foundation. Probably the single best way you could spend some money today.

08 May 2025

OMG Millennium Villages Increase Poverty ROFL!!

The ~~Millennium Village PR Department~~ Guardian newspaper reports "Child mortality down by a third in Jeffrey Sachs's Millennium Villages." Which is possibly true (I'm not going to even go into the validity of the non-random controls). But if you take a casual glance at the paper's results table, you'll also find no statistically significant impact of the project on poverty, nutrition, education, or child health.

Of all 18 indicators, 10 are totally statistically insignificant (no difference between intervention and comparison) and only 1 of the 18 indicators is significant at the 1% level.

The text of the Lancet paper mentions 3 times that poverty has fallen in the village sites. And just once that this reduction is actually no different to that in comparison villages.

And check out this sentence;

For 14 of 18 outcomes, changes occurred in the predicted direction. No significant differences were recorded when comparing poverty ...

So, mention the direction of the effect when it is the direction you want (but statistically insignificant from zero), and neglect to mention the direction of the effect when it is the direct opposite of what you want (but also insignificant).

Now THAT, folks, is science. (Here's the Lancet link, HT: Maham).

13 April 2025

Scaling up Proven Interventions

If you are in the business of piloting development policies with NGOs, this chart should be keeping you awake at night. If you like to think about "sustainability" and "scale", about handing over your activities to the government, you need to be really really worried.

Researchers persuaded World Vision and local government in Kenya to both implement that exact same intervention at the same time. The program as implemented by World Vision found a large impact on test scores. The exact same program, as implemented by the local government, found zero impact.

For more see Gabriel Demombynes.

His conclusion:

Evaluation skeptics may try to cite this as evidence that RCTs are a waste of time, since it suggests that successful interventions implemented by NGOs, as they often are in experiments, may not be replicated at scale by governments. Others might take the paper to indicate that NGOs should be the preferred vehicle for interventions. I think these readings would be mistaken, and I take two reflections from the paper. First, we should do many more rigorous studies working with governments where we vary forms of service delivery to better understand what can work in practice. Second, the World Bank’s approach to public services—the long, difficult slog of working to improve government systems—is the right one, because it’s the only way to ultimately make services work for the poor at large scale.

I agree. Clearly working through government systems is essential. Innovations for Poverty Action are doing just this - scaling up the exact same contract teacher program tested with NGOs in Kenya and India, but doing it with the government in Ghana, and doing an RCT as they go.

Addendum: Here is the link to the full paper: http://www.cgdev.org/doc/kenya_rct_webdraft.pdf

07 March 2025

The RCT Bubble

Is it really a bubble? Whilst there has been rapid growth of impact evaluations of aid projects across agencies, and plenty of internet chatter, the vast majority of aid spending still does not get properly evaluated.

For example, while there has been a substantial growth in impact evaluations of the World Bank development projects, only 8.8% of World Bank investment loans in 2009/10 had an impact evaluation. In 1999/00 the proportion was 2.4%. ----- Martin Ravallion, World Bank Research Director

09 February 2025

Evaluating TOMS shoes, child sponsorship, cow donations, and fair trade coffee

And the prize for most diverse recent set of popular NGO development program intervention evaluations goes to (drum roll please) **~!Bruce Wydick!~**

James Choi quotes him writing in Christianity Today on Fair Trade Coffee:

Fair-trade coffee isn't a scam, but it is hard to find a development program that has attracted so much attention while having so little real impact. The most recent rigorous academic study, carried out by a group of researchers at the University of California, finds zero average impact on coffee grower incomes over 13 years of participation in a fair-trade coffee network.

So I looked up his page and found this on child sponsorship:

Although international child sponsorship may be the most widespread form of personal contact between households in wealthy countries with the poor in developing countries, to date there are no published studies that have analyzed whether the beneficiaries of these programs have experienced changes in their life outcomes.

We find large and statistically significant impacts of the [compassion international] child sponsorship program on most of our outcome variables.

And then there are rigorous impact evaluations to come on Heifer International and TOMS shoes - exciting stuff!

14 April 2025

More Than Good Intentions

More Than Good Intentions: How a New Economics is Helping to Solve Global Poverty is the new book by Dean Karlan and Jacob Appel, released today.

I'm about 95% certain that I would be able to tell you I love the book even if I wasn't being paid to promote it. It's like Freakonomics only about global development.

If everyone would just read this book then I would probably be out of a job because you would all be totally convinced of the need for smart evidence-based aid and know all about the fantastic research that IPA is involved with. And I still want you to read it.

So go on, make me unemployed, I dare you.

You can read Chapter 1 here.

08 February 2025

Evaluation in Sudan

The OECD has just released

Aiding the Peace: A Multi-donor Evaluation of Support to Conﬂict Prevention and Peacebuilding Activities in Southern Sudan 2005-2010

Firstly that is a terrible terrible pun.

Secondly not having read it yet, I'm guessing you would not call this a rigorous or quantitative evaluation as such. Still, probably worth a read for anyone interested in aid and Sudan.

22 September 2024

Transparent Impacts?

New post up on the IPA blog.

06 August 2025

The Impact of Evaluation

Note: This was first posted on the IPA blog.

Alanna Sheikh started a bit of a debate last week on the limitations of impact evaluations. She cites Andrew Natsios (a former USAID administrator)

USAID has begun to favor health programs over democracy strengthening or governance programs because health programs can be more easily measured for impact. Rule of law efforts, on the other hand, are vital to development but hard to measure and therefore get less funding.

Lots of things are vital for development, but something being vital doesn’t mean that aid funding is necessarily an effective way of supplying it. Not only that, but something being difficult to measure does not make it impossible. And sure enough, JPAL and IPA have conducted a number of evaluations of governance projects, such as working with the police in Rajasthan, on peace education and ex-combatant reintegration projects in Liberia, and evaluating anti-corruption strategies in Indonesia.

Randomised impact evaluations give the strongest evidence available on a project’s effectiveness. If USAID is beginning to favor projects with evidence of impact that is a good thing. The challenge for governance and rule of law advocates is to prove their impact.

Dennis Whittle of Globalgiving.org adds another limitation:

Formal evaluations, including the gold standard of randomized controlled trials, are not scalable. We simply do not have the time and resources to do centralized, in-depth evaluations of everything.

This argument is like not bothering with lifeboats if they can’t fit everyone in. Evaluations are crucial if we are going to learn whether or not we are wasting our money. And who knows, we might not be able to evaluate every single project, but if we keep coming up with compelling theories of change and keep replicating our findings in different settings, we could certainly try to evaluate every single intervention.

01 August 2025

Learning to Randomise

So yesterday was the last day of this year’s training week for JPAL and IPA staff based in Africa. I’m going to be starting in September as the IPA Communications Coordinator (professional blogger?!) based in New Haven, so this was a great opportunity for me to meet a bunch of the field and head office staff, and learn a bit more about how JPAL and IPA actually conduct an evaluation.

Things I have learned:

The phrase for “gamble” in Liberian English is to “put it in the hole.” Which apparently makes for some entertaining interviews with youths about their appetite for risk.
Limuru, Kenya, is COLD!
People can go to great lengths to get a drink. (but this is OK).
IPA staff are awesome. Everyone is fun, clever, and motivated, and are all total data-geeks. My kind of people.
Oh yeah, and how to conduct a randomised evaluation.

Here is my ultra-condensed summary.

Day 1: Why randomise anyway? (Because if you care about measuring the impact of your project - this is by far the best way)

Day 2: How to use STATA. (just spend a few hundred hours learning to code)

Day 3: How to design an evaluation, and how to manage data collection in the field. (what are you randomising and why? how do you manage the logistics in the field)

Day 4: Data entry and project management.

Day 5: Ethical and privacy issues, dealing with research problems, and budgeting (I totally rocked the budgeting session)

Day 6: Bringing it all together: presenting a complete project design from start to beginning.

If you are interested, all of the training materials are available on the MIT website, including videos, lecture notes, case studies and exercises.

The only thing not included is the trip to Lake Naivasha to walk amongst the zebra and giraffe. For that, you might just have to go and sign up…

09 March 2025

03 February 2025

15 February 2025

21 January 2025

28 October 2024

20 July 2025

21 May 2025

14 May 2025

08 May 2025

13 April 2025

07 March 2025

09 February 2025

14 April 2025

08 February 2025

22 September 2024

06 August 2025

01 August 2025

Subscribe

Subscribe via email

About

Labels

Posts by Guest Authors

Blog Archive

Popular Posts

Careers

Translate

Blogroll

Music Blogs

Total Pageviews

Search This Blog