Hide table of contents

TL;DR

Between 2022 and 2023, GiveWell and Open Philanthropy committed over $70 million to Evidence Action’s Dispensers for Safe Water programme. An independent survey in 2025 found that in Uganda and Malawi the programme was reaching approximately 70% fewer people with detectable chlorine than its own monitoring had indicated (approximately 40% fewer in Kenya).

This piece argues that this case-study reflects a structural gap in how sophisticated funders allocate analytical rigour: extraordinary attention is paid to whether an intervention works under controlled conditions with far less to whether its implementation is actually delivering impact. I propose six practical steps funders can take to evaluate implementation fidelity with the same seriousness they apply to academic evidence, and I believe funders have a clear responsibility to do so.

Disclosure

I am the CEO of Fortify Health, an organisation that has received funding from GiveWell. I also sit on the Governance and Advisory Boards of organisations that have received funding from GiveWell. This article reflects my personal opinions and not those of Fortify Health or other associated organisations. 

I shared a draft of this article with GiveWell and Evidence Action prior to publication. Both organisations generously shared comments and clarifications; however, any errors are my own. Staff at GiveWell did not provide feedback on my analysis.

Introduction

The effective altruism community has rightly praised Evidence Action and GiveWell for the transparency with which they have handled the Dispensers for Safe Water (DSW) situation. But transparency in reporting a major oversight with practical impacts is different from preventing one. The harder question is this: how did a programme with extensive oversight, evidence-driven funding, and a grantee whose identity is built on measurement rigour produced a monitoring architecture that, over time, produced systematically biased data without triggering a funder response earlier?

Between 2022 and 2023 alone, GiveWell and Open Philanthropy committed more than $70 million to Evidence Action’s Dispensers for Safe Water programme [1][2]. An independent survey commissioned in 2025 found that approximately 70% fewer people in Malawi and Uganda and 40% few people in Kenya (~2 million) had detectable chlorine in their water than the programme’s own monitoring had indicated (~5 million) [3]. As Evidence Action itself acknowledged, “our monitoring system lacked sufficient independence and created conditions where bias could enter” [3].

My thesis is this: the DSW case reveals a structural asymmetry in how sophisticated funders allocate rigour. Extraordinary analytical attention is applied to whether an intervention works. Comparatively little structured attention is applied to an equally important question: will this organisation actually deliver it? I believe funders have a clear and active responsibility to close this gap, not as an optional add-on to their existing frameworks, but as a core part of their role. This piece reviews what happened, diagnoses the structural problem, and proposes six practical steps to address it. 

Both Evidence Action and GiveWell have responded with genuine transparency and GiveWell is taking clear steps in the right direction. The question is whether those steps become systematic and transparent.

 

Part I: What Actually Happened

1.1 The Dispensers for Safe Water (DSW) Intervention and its Evidence

Understanding why DSW attracted such significant investment requires understanding how strong the underlying evidence looked. Between 2003 and 2010, economists including Michael Kremer, Edward Miguel, and Sendhil Mullainathan conducted RCTs on point-of-use chlorine dispensers in Kenya [4]. The results were compelling: the dispenser model, leveraging paid community promoters, increased chlorination uptake by 53 percentage points, with sustained effect sizes over 30 months [4].

In April 2022, GiveWell published a major update to its assessment of water quality interventions [5]. GiveWell’s meta-analysis estimated a 14% reduction in all-cause under-five mortality from water chlorination, with its best-guess effect for the Dispensers for Safe Water programme specifically at approximately 6% — a relatively large impact from such a cost-effective intervention [5][6]. This moved the programme’s estimated cost-effectiveness to ~seven times GiveWell’s benchmark [6], solidly within the range used to justify large grant recommendations at the time. The evidence base for the intervention was real.

The question that would come to define this case was different: whether the intervention was being adopted at the levels projected for cost-effective delivery (and in many geographies, it was not).

1.2 The Monitoring Architecture and Its Structural Flaws

The monitoring failure here was not fraud. It was a structurally biased architecture that should have been identified earlier. Evidence Action is one of the more sophisticated implementing organisations in the global health space, which makes the nature of the situation all the more striking. Coverage data was collected by an internally managed monitoring team — the Monitoring, Learning, and Evaluation (MLE) field officers — who were organisationally separate from programme staff but insufficiently independent from programme operations. 

Chlorine levels were measured using manual colour wheel kits, which, while generally appropriate for this context, became a secondary source of bias in combination with the structural independence issue. As GiveWell noted, data being collected by Evidence Action’s own staff may have led them to interpret colour wheel results more favourably [7]

To Evidence Action’s credit, their quality assurance system involved follow-ups with 5–10% of surveyed households, verified with GPS coordinates and timestamps, and was designed to flag statistical outliers [3]. In hindsight, this was insufficient to detect the biases embedded in the architecture itself. As Evidence Action acknowledged, the primary failure was structural: the monitoring system lacked sufficient independence, creating conditions where bias could enter at multiple points [3]. They further noted that randomisation skewed towards known dispenser users, missing households that had stopped chlorinating or never adopted [3]Kevin Starr of the Mulago Foundation described this as a rookie mistake [8], a characterisation that carries particular weight given how central measurement is to Evidence Action’s identity. The conflicts of interest embedded in the monitoring design, while not fraudulent, were clear in hindsight and, in retrospect, should have been flagged at least for questioning.

1.3 What the Independent Review Found and When

The timeline of when problems were identified matters as much as what was eventually found. In January 2022, GiveWell recommended a grant of up to $64.7 million to Evidence Action [1]. It was not until March 2024 that GiveWell received data from the Kenya Study of Water Treatment and Child Survival, collected between 2019 and 2021, indicating that chlorination rates may have been substantially lower than Evidence Action’s routine monitoring suggested [6]. As GiveWell acknowledged, the comparison was methodologically complex, which prompted Evidence Action, in consultation with GiveWell and the research team, to design and conduct a more substantive follow-up study in Kenya using both Evidence Action’s own monitoring methodology and an independent full census [6].

It is fair to question why that external data was not received until 2024, though I am not aware of the full reasons for the delay (and this is not relevant to the overall thesis). In March and April 2025, GiveWell funded a fully independent survey in Uganda and Malawi, conducted by the Development Innovation Lab and Innovations for Poverty Action [9]. Results confirmed that approximately 2 million as opposed to the estimated 5 million were receiving chlorinated water across Uganda, Malawi and Kenya [3]. Actual chlorine use was approximately one third of what monitoring measured, and approximately 30% of dispensers lacked chlorine altogether [3].

1.4 Evidence Action and GiveWell’s Response

Both Evidence Action and GiveWell have responded with genuine transparency, acknowledging the structural monitoring failures and making difficult decisions to reduce and in some cases end key programmes, along with all the challenges those decisions entail. As GiveWell reflected, knowing what it knows now, it probably would not have made this grant again at its full size [6].

Evidence Action has committed to engaging external experts to review monitoring protocols across its full portfolio in 2026[3]. As Elie Hassenfeld, CEO of GiveWell, put it: “That kind of collaboration, that willingness and desire to be transparent, even when the news might not be good, is not something that you should take for granted. That’s extremely rare in the nonprofit world” [7]. I believe that framing is genuine and the transparency demonstrated here is rare and valuable. It does not, however, close off the more important question: how did this happen, and what structural changes are needed to ensure it is less likely to happen again? The opportunity cost of these funds is significant, particularly in light of the broader funding pressures the sector has faced over the past eighteen months.

Part II: The Structural Diagnosis

2.1 A Wider Pattern

The monitoring architecture that failed at DSW is not unusual. It is likely closer to standard practice across global health programming. Evidence Action could be seen as a worst-case scenario precisely because they are one of the most rigorous implementing organisations, backed by one of the most rigorous data-driven funders in the world. Internally managed monitoring staff, community-derived sampling frames, and quality assurance focused on procedural compliance rather than output validity are common features across the sector. If this architecture failed in one of the most rigorously evaluated, most scrutinised programmes in the EA ecosystem, the implications for programmes with lighter-touch oversight may be quite serious. Evidence Action’s own disclosure reinforces this: preliminary results suggested lower adoption rates than previously estimated, with similar monitoring challenges, in their in-line chlorination programme in Uganda and Malawi, which used the same underlying monitoring design [3].

2.2 The Analytic Gap: How Funders Allocate Rigour

I believe the central problem is a structural asymmetry in how analytical rigour is allocated. EA-aligned funders apply extraordinary rigour to the question: does this intervention work under controlled conditions? With comparatively little structured attention applied to an equally important question: is the on-ground implementation delivering the impact? 

GiveWell’s cost-effectiveness analyses include adjustment factors that touch on implementation: internal validity for RCT quality, external validity for generalisability, grantee-level quality and track record assessed on a qualitative basis, and downside adjustments including for quality of monitoring and evaluation [6]. It is not that GiveWell ignores implementation. The problem is that these adjustments are structurally insufficient given the level of impact that implementation quality can have. 

GiveWell’s cost-effectiveness model includes two named adjustment factors that relate to monitoring quality: one for the risk of misappropriation without monitoring results, and one for the risk of false monitoring results [11]. These are meaningful inclusions. The problem is structural: both are applied as blended percentage discounts to the final cost-effectiveness output. Neither is decomposed into auditable sub-dimensions. Neither asks whether enumerators are independent of the staff being evaluated, whether the sampling frame is neutral, whether the measurement instrument is objective, whether temporal verifications are in place, or what the organisational processes are between a field data collector and the number in the model.

Coverage rates, usage rates, and cost per beneficiary denominators come almost entirely from implementing organisations’ own monitoring data [6]. GiveWell’s analysis of the RCT evidence base for water treatment is extensive and sophisticated [5]. The organisational and implementation quality analysis, the part that determines whether those effect sizes are actually realised in practice, receives minimal qualitative treatment and a single adjustment factor. This is a structural feature of a model designed to evaluate intervention theory rather than implementation reality. In the development sector, impact comes through implementation, not intent.

2.3 The Ecosystem Multiplier

The stakes are amplified by the reach of GiveWell’s analysis. GiveWell directed $397 million to its recommended charities in 2024 alone and has directed a cumulative $2.6 billion since its founding [12][13]Coefficient Giving (formerly Open Philanthropy, rebranded November 2025) directed over $1 billion to effective causes in 2025 [14]Giving What We Can has inspired over $375 million in donations [15], and The Life You Can Save has moved over $100 million to effective charities [16]. GiveWell’s recommendations do not only drive its own allocation decisions. Structural gaps in its framework are effectively replicated across a significant share of EA-aligned philanthropic capital. When GiveWell maintains Dispensers for Safe Water as a high-rated programme, that signal propagates to the broader ecosystem: not just in terms of direct funding, but in terms of the opportunity cost of that funding and the methodological norms that shape how cost-effectiveness is measured more broadly.

2.4 The Fiduciary Framing

EA-aligned funders hold themselves to exceptional standards of evidence. Applying those standards to the academic literature is one thing; applying them to the organisations through which they deploy capital is part of the same obligation. A fund manager who allocates capital based largely on a company’s self-reported financial data, applying a single blended discount for audit quality, would not be considered to have met their fiduciary duty. The parallel in philanthropy is structurally similar, particularly when researchers are acting on behalf of donors who trust them to ensure their money is having the impact claimed.

GiveWell has recognised aspects of this in its public statements and has taken meaningful steps in the right direction [7]. I believe there are opportunities to make these even more systematic and transparent as part of the cost-effectiveness analysis and grant evaluation process. 

Part III: Six Proposals for Improved Monitoring of Implementation

These proposals are offered in the spirit of a field that wants to do better. Evidence Action and GiveWell have already taken meaningful steps through commissioning independent surveys, publishing results honestly, and redirecting funding accordingly [7][6]

GiveWell has also clearly taken institutional steps over the past year to direct multiple grants (example of IDinsight grant to review AMF’s monitoring) towards coverage and monitoring surveys and conducting red-teaming of monitoring and evaluation processes. These are very positive steps. 

3.1 Monitoring Plans as Distinct, Costed, Evaluable Grant Components

Challenge: Currently, monitoring is either embedded in programme budgets, creating cost pressure towards cheaper and less independent methods, or treated as grantee-reported assurance rather than a funder-evaluated design question. Neither is sufficient.

Proposed Solution: Researchers could require standalone monitoring plans as a condition of grant approval, evaluated separately from the programme proposal against a clear set of criteria. Suggestions on those criteria are set out in section 3.4 below. Independent monitoring should also be budgeted as an explicit percentage of total grant value, particularly for grants above a certain size, removing the incentive structures that push implementing organisations towards cheaper, internally managed approaches rather than more expensive internal systems or independent external verification.

3.2 A Structured Implementation Fidelity Layer in Cost-Effectiveness Models

Challenge: Current cost-effectiveness models collapse implementation quality into a single blended adjustment, making uncertainty about implementation fidelity largely invisible in the cost-effectiveness figure. This makes it difficult to understand how sensitive estimated impact is to implementation uncertainties, which can, as this case shows, be very sensitive indeed.

Proposed Solution: Add explicit inputs and probability distributions for key implementation dimensions into CEAs, for example: monitoring independence, sampling frame integrity, measurement objectivity, data triangulation, temporal verification, and public data availability. Allowing uncertainties in these dimensions to propagate through cost-effectiveness calculations would produce a range that genuinely reflects implementation uncertainty, enabling the community to better understand the true confidence interval around potentially highly effective interventions.

3.3 Value of Information as a Framework for Monitoring Investment

Challenge: Independent monitoring is often treated as a cost rather than a value-creating investment. GiveWell already uses value of information (VoI) thinking when evaluating research opportunities and grant decisions [6][7]. The same methodology could be applied to questions of implementation quality: what is the expected value of investing in external implementation validation, given the current level of uncertainty?

Proposed Solution: In the case of Dispensers for Safe Water, the cost of the DIL and IPA independent survey was a fraction of the annual grant value [9]. Given the scale of the programme, the monitoring uncertainty, and the cost-effectiveness stakes, the expected value of that information was enormous, particularly when it had the potential to reveal changes of 50 to 70% in overall cost-effectiveness. VoI framing converts monitoring from a compliance cost to an investment decision subject to the same rigour as other allocation choices. I believe that this is now occurring more frequently. 

3.4 A Monitoring Plan Evaluation Checklist

Challenge: Without structured criteria for evaluating monitoring designs, funders must rely on qualitative judgement, which may not surface structural flaws of the kind that have occur with DSW. 

Proposed Solution: The criteria below are a starting point, offered with the acknowledgement that those with greater field expertise will improve on them. It is worth noting that many of these considerations were discussed during the original DSW grant-making process and cost-rigour trade-offs were assessed by both internal and external parties at the time. A more formalised checklist may nonetheless help ensure these dimensions are applied systematically and consistently across the portfolio. 

  • Independence of data collection: Are enumerators independent of the staff whose performance is being evaluated? Who determines the sampling frame?
  • Data triangulation: Are multiple independent data sources being used? Is there a mechanism to cross-check self-reported or programme-collected data? To what extent is data publicly available?
  • Temporal relationship: Is monitoring measuring current field conditions or lagging reporting? How quickly does monitoring data reflect changes in delivery reality?
  • Point of delivery: Is measurement occurring at the point of actual service delivery or at an abstracted proxy?
  • Public vs. private data: Is monitoring data publicly available for external review, or held internally?
  • QA design: Is quality assurance designed to verify process compliance (meaning that monitoring is being conducted correctly) or output validity (meaning that the data actually reflects reality)?

A more robust checklist, developed with practitioner input, could bring meaningful rigour and transparency to how monitoring plans are assessed at the grant stage.

3.5 Field Expertise on Funder Teams

Challenge: The analytical gap between RCT evidence evaluation and implementation quality evaluation is partly a staffing and knowledge gap. The skills needed to evaluate whether a monitoring architecture can detect its own failure are different from those needed to interpret meta-analyses of RCT evidence and build cost-effectiveness models. 

Proposed Solution: I believe funders within the effective altruism space who invest heavily in academic research should also invest in people with contextual, operational programme management experience, people who can ask whether a proposed monitoring design can detect its own failure modes. This is a complement to quantitative rigour, not a replacement.

GiveWell’s move to requiring external surveys for all safe water grants [7] is a version of this, outsourcing the operational quality check to an independent evaluator. The broader principle is to institutionalise this capacity rather than apply it reactively. GiveWell already leverages academic experts, red-teaming exercises, and lookback reviews. Supplementing those with practitioners who have programmatic expertise can help identify implementation fidelity risks earlier, challenge potential grantees on those risks, and rigorously evaluate proposed monitoring plans.

A supplement to this would be more field visits to better understand and visualize implementation realities and potential challenges. 

3.6 Evaluation of Grantee Governance Mechanisms

Challenge: Researchers and funders should be able to rely, to some extent, on internal organisational governance mechanisms to interrogate key assumptions and risks. Currently, there is no explicit attention paid to evaluating these mechanisms — particularly Boards and governance structures that hold legal responsibility for scrutinising implementation organisations' leadership and decisions.

Proposed Solution: There may be scope for funders in the EA space to bring greater rigour to the evaluation of organisational governance. This could include examining Board composition, reviewing Board meeting minutes as a proxy for Board engagement, audited financials, and other governance systems that serve to mitigate organisational risk and ensure leadership accountability.

Part IV: Responding to Potential Objections

4.1 “This would cost too much”

Monitoring, in general, represents a small percentage of programme budgets. The level of specialised technique or academic infrastructure required for robust monitoring is relatively minimal when compared with evaluations (especially randomised controlled trials). With the advent of artificial intelligence, optical character recognition, and mechanisms for collecting data remotely, the cost of independent monitoring is decreasing rapidly and will continue to do so. There is also a question of proportionality. EA-aligned organisations already invest significantly in analytical capacity [17] and the cost of the DIL and IPA independent survey that identified the DSW failure is highly cost-effective within this financial paradigm. Ensuring an equivalent level of rigour, if not an equivalent level of financial investment, on something that can have an equally large effect on overall cost-effectiveness is, I would argue, entirely reasonable.

4.2 “We should trust our grantees”

A general orientation of trusting grantees, finding good people with strong interventions and a track record and trusting them to deliver, is broadly appropriate and I share it. But trust is not incompatible with verification, and it is not built in a vacuum. Trust is built through transparency and high-quality information sharing over time. The goal is not to impose burdensome reporting or regulatory frameworks on implementing organisations. It is to thoughtfully right-size monitoring systems so that they provide the information needed to maintain genuine trust.

It is also worth reframing who monitoring primarily serves. High-quality monitoring is not a concession to funders; it is a fundamental decision-making tool for organisations themselves. It helps them better understand targeting, identify opportunities for resource reallocation, and provides early warning signals when an intervention is not reaching people at the levels intended. Monitoring built well serves the organisation’s mission first. The funder’s need for verification is a secondary benefit.

4.3 “Monitoring should be proportionate to the grant”

I agree entirely. The level of monitoring investment, in time, systems, and capital, should be right-sized to the intervention, the level of uncertainty, the value of information, and the nature of what is being measured. Are there key metrics that are highly sensitive to overall cost-effectiveness? If so, greater investment is warranted. Where interventions already have high-quality, objective, third-party-collected, and publicly available data systems, additional monitoring may be minimal or even unnecessary. Certain technological interventions, for instance, have monitoring effectively baked into the delivery mechanism itself. The principle is not uniform investment; it is proportionate investment calibrated to where the uncertainty and the stakes are highest.

4.4 “This means outsourcing monitoring, with all the risks that entails”

Not necessarily. Much of what I am proposing can be done internally, but it requires developing systems robustly, asking the right questions, and ensuring triangulation and validation, as with any internal auditing process. 

External audits serve their purpose at the right time, for the right questions, and at the right cost, but they are not the only mechanism. The DSW case illustrates that even where internal monitoring staff have a separate reporting structure from programme staff, proximity to programme operations can still introduce bias. The key principle is structural independence sufficient to detect the monitoring system’s own potential failures, whether achieved through internal redesign or external verification. QA focused on output validity rather than procedural compliance can achieve much of what is needed without a large external apparatus.

Summary

The Dispensers for Safe Water case is not an argument against evidence-based global health funding. It is an argument that evidence must cover both questions: does this intervention work under controlled conditions, and is it being delivered at the projected cost-effectiveness? Those are equally hard questions that require equal rigour.

The effective altruism movement’s credibility rests on the claim that it takes evidence seriously. The response over the past few months demonstrates that commitment. Extending it to implementation fidelity is not a weakening of that commitment; it is a completion of it. Funders have a clear responsibility here, not just as evaluators of academic evidence, but as active participants in ensuring their investments reach the people they are intended to reach. 

GiveWell and Evidence Action have taken a difficult and admirable step. I believe it is incumbent on the community to ask what structural shifts are needed to ensure that this case becomes a turning point rather than a one-off response. From my broader experience working in global development and global health, I suspect this may be the tip of the iceberg, though I want to be clear that this is anecdotal and difficult to empirically verify. That is not an argument for pessimism; it is an argument for building better systems. I hope this piece serves as a contribution to that conversation.

As Elie Hassenfeld said, “We need to figure out how to do this monitoring well." [7]

References

[1] GiveWell: Evidence Action DSW General Support grant (January 2022)

[2] Open Philanthropy: Evidence Action DSW grant (February 2022)

[3] Evidence Action: Following the evidence, even when it’s hard (March 5, 2026; corrected March 27)

[4] IPA: Chlorine Dispensers for Safe Water — programme history

[5] GiveWell: A major update in our assessment of water quality interventions (April 2022)

[6] GiveWell Lookback: Dispensers for Safe Water (September 2025)

[7] GiveWell Podcast Episode 25: Following the Data on Dispensers for Safe Water (March 5, 2026)

[8] Kevin Starr, Mulago Foundation: LinkedIn post (March 2026)

[9] GiveWell: DIL/IPA Chlorine Coverage Surveys in Uganda and Malawi (March–April 2025)

[10] GiveWell: Safe Water Projects — Saving Lives and Improving Our Grantmaking (October 2025)

[11] GiveWell: Our criteria — cost-effectiveness methodology

[12] GiveWell: About — Our Impact

[13] GiveWell: 2024 Metrics and Impact (August 2025)

[14] Coefficient Giving (formerly Open Philanthropy): About Us

[15] Giving What We Can

[16] The Life You Can Save: Annual Reports

[17] GiveWell: Senior Researcher job posting — salary bands

53

1
0
1

Reactions

1
0
1

More posts like this

Comments
No comments on this post yet.
Be the first to respond.
Curated and popular this week
Relevant opportunities