Why Government AI Investment Cases Overestimate Returns by 2.5x: A Monte Carlo Framework with Empirically-Calibrated Failure Modes
Introduction
Government AI investment cases routinely overestimate returns because they ignore three well-documented public sector risk factors: procurement delays that defer benefits by 6-24 months (OECD 2023), IT project cost overruns that affect 45% of government projects (Standish Group CHAOS 2020), and political defunding that cancels 3-5% of multi-year initiatives annually (Flyvbjerg 2009). Standard ROI calculators — whether built by consultants or AI systems — typically model best-case adoption curves without these failure modes, producing NPV estimates 3-10x higher than realistic projections.
This paper makes a narrow, testable contribution: we build a Monte Carlo simulation framework that incorporates these empirically-documented government failure modes and demonstrate how dramatically they change investment conclusions. We apply the framework to AI investment cases in two government sectors (Brazil tax administration, Saudi Arabia municipal services) and quantify the gap between naive and failure-adjusted projections.
Our contribution is the economic modeling methodology, not the sector selection process. We use an LLM (Claude) to assist with structured sector analysis — generating scored assessments and identifying international benchmarks — but the LLM is a research tool, not the research contribution. The contribution is showing that government AI investment cases require failure-adjusted Monte Carlo analysis to produce credible projections.
The Overestimation Problem
Standard government AI ROI calculations typically assume:
- Implementation begins immediately (no procurement delay)
- Projects are delivered on budget (no cost overrun)
- Projects run to completion (no political defunding)
- Adoption reaches 90-100% (no adoption ceiling)
- Benefits match benchmark levels (no optimism bias adjustment)
Each assumption is individually contradicted by empirical evidence:
| Assumption | Reality | Source |
|---|---|---|
| Immediate implementation | 6-24 month procurement delay | OECD Government at a Glance 2023 |
| On-budget delivery | 45% of govt IT projects exceed budget | Standish Group CHAOS 2020 |
| Project completion | 3-5% annual cancellation probability | Flyvbjerg, Oxford Rev Econ Policy 2009 |
| Full adoption | 65-85% ceiling in government | World Bank GovTech 2022 |
| Benchmark-level benefits | HM Treasury recommends 20-40% downward adjustment | UK HM Treasury Green Book 2022 |
When these factors compound over a 10-year horizon in Monte Carlo simulation, the effect on expected NPV is dramatic.
Methodology
Monte Carlo Framework
We run 5,000 simulations per investment case. Each simulation samples:
- Cost overrun — Bernoulli(0.45) trigger × Uniform(1.1, 1.6) multiplier on initial investment
- Procurement delay — Uniform(0.5, 2.0) years of zero benefits with partial cost accrual
- Political defunding — Annual Bernoulli(0.03-0.05) that terminates all future benefits with sunk costs
- Adoption ceiling — Uniform(0.65, 0.85) maximum adoption rate, with logistic S-curve ramp:
- Benefit multiplier — Uniform(0.5, 1.5) on annual benefits to capture estimation uncertainty
NPV computed at government-appropriate discount rates (8% for Brazil reflecting sovereign risk premium, 6% for Saudi Arabia reflecting lower risk):
where is the sampled benefit multiplier and includes overrun-adjusted investment and operating costs.
Input Parameter Estimation
For each case study, we estimate investment costs, annual benefits, operating costs, and transition costs from international benchmarks:
Brazil (Tax Administration):
- Benchmark: HMRC Connect achieved 1.5% collection yield improvement (UK NAO HC 978, 2022-23)
- We estimate 0.05% uplift for Brazil (1/30th of HMRC) due to greater tax system complexity (60+ tax types, 3,000+ regulations) and lower institutional capacity
- Investment: BRL 450M based on comparable government IT procurement scales
- Annual benefit at full adoption: BRL 1,700M (revenue uplift + audit efficiency + compliance deterrence)
Saudi Arabia (Municipal Services):
- Benchmark: Singapore BCA reduced permit processing from 26 to 10 days (BCA Annual Report 2023)
- We estimate 20% expat workforce cost reduction (conservative vs Singapore's 35% operational savings)
- Investment: SAR 280M
- Annual benefit at full adoption: SAR 470M (labor savings + efficiency + fee uplift)
We used Claude (LLM) to assist with identifying these benchmarks and structuring the sector analysis. The LLM generated scored sector assessments and suggested relevant international comparisons, which we then verified against published sources. The LLM is a research assistant in this workflow, not the analytical methodology.
Naive vs Failure-Adjusted Comparison
For each case, we compute:
- Naive NPV: standard DCF assuming on-time, on-budget, full-adoption, no defunding
- Failure-adjusted NPV: Monte Carlo median with all five failure modes active
- Overestimation ratio: Naive NPV / Failure-adjusted median NPV
Results
Brazil: Tax Administration AI
| Metric | Naive | Failure-Adjusted |
|---|---|---|
| NPV | BRL 8,420M | BRL 3,361M |
| IRR | 125% | 50% |
| BCR | 9.8:1 | 4.0:1 |
| P(NPV > 0) | 100% (assumed) | 81.5% |
| P5 (worst case) | N/A | BRL -679M |
| Overestimation ratio | 2.5x |
The naive estimate overstates NPV by 2.5x and completely masks the 18.5% probability of negative returns. The P5 outcome (BRL -679M) reveals genuine downside risk from procurement delays combined with early political defunding.
Sensitivity ranking: NPV is most sensitive to (1) adoption ceiling, (2) benefit multiplier, (3) procurement delay length. Cost parameters rank lowest, confirming that the primary risk is organizational, not financial.
Saudi Arabia: Municipal Services AI
| Metric | Naive | Failure-Adjusted |
|---|---|---|
| NPV | SAR 2,870M | SAR 1,119M |
| IRR | 82% | 38% |
| BCR | 5.8:1 | 2.5:1 |
| P(NPV > 0) | 100% (assumed) | 84.5% |
| P5 (worst case) | N/A | SAR -378M |
| Overestimation ratio | 2.6x |
Similar 2.6x overestimation. Saudi Arabia's lower defunding risk (3% vs Brazil's 5% due to Vision 2030 royal mandate) is partially offset by multi-region rollout complexity.
Comparison with Historical Outcomes
Our failure-adjusted BCRs fall within the range of actual government IT project outcomes:
| Project | Actual BCR | Source |
|---|---|---|
| HMRC Connect (tax AI) | 10-15:1 | UK NAO HC 978, 2022-23 |
| IRS enforcement AI | 5-12:1 | IRS Publication 1500, 2023 |
| Singapore BCA CORENET | 2.8:1 | BCA Annual Report 2023 |
| India Aadhaar | 2.0:1 | World Bank Evaluation 2023 |
| Our Brazil (adjusted) | 4.0:1 | — |
| Our Saudi (adjusted) | 2.5:1 | — |
Note: HMRC and IRS BCRs (10-15:1 and 5-12:1) are for mature, operational systems. Our estimates are for projected new deployments and appropriately fall below these established programs.
Key Finding: The 2.5x Overestimation Factor
Across both case studies, naive projections overestimate failure-adjusted NPV by approximately 2.5x. This is consistent with Flyvbjerg's (2009) finding that government infrastructure projects systematically exhibit "optimism bias" in appraisal, with benefit shortfalls typically in the 20-60% range and cost overruns in the 10-50% range.
Implication for practice: Government AI investment cases prepared without Monte Carlo failure adjustment should be treated as approximately 2-3x overestimates. Decision-makers should demand failure-adjusted projections before committing resources.
Discussion
Contribution
This paper's contribution is narrow and specific: demonstrating that standard government AI investment projections systematically overestimate returns by ignoring well-documented public sector failure modes. The Monte Carlo framework we present incorporates five empirically-calibrated risk factors and produces projections consistent with historical government IT outcomes.
Limitations
- Two case studies. The 2.5x overestimation factor may not generalize. Additional sectors and countries would strengthen (or revise) this estimate.
- Input parameter uncertainty. Benchmark-derived benefit estimates are approximations. The Monte Carlo quantifies sensitivity to these inputs but cannot verify them.
- No ex-post validation. We compare against historical BCR ranges (plausibility check), not against actual outcomes of these specific proposed projects.
- LLM-assisted analysis. We used Claude to identify benchmarks and structure sector assessments. While we verified LLM outputs against published sources, we did not systematically evaluate LLM accuracy in this role.
Future Work
The overestimation factor could be validated retrospectively by applying the framework to completed government AI projects and comparing failure-adjusted projections against actual outcomes. Additionally, the five failure mode distributions could be calibrated to specific country procurement environments rather than using global averages.
Conclusion
Government AI investment cases that ignore procurement delays, cost overruns, political defunding, adoption ceilings, and optimism bias overestimate expected returns by approximately 2.5x based on our two case studies. The Monte Carlo framework presented here, grounded in Standish CHAOS, Flyvbjerg, and HM Treasury empirical data, produces projections consistent with historical government IT outcomes (BCR 2.5-4.0:1 vs historical range of 2.0-15.0:1). We recommend that all government AI investment appraisals incorporate failure-adjusted Monte Carlo analysis rather than relying on deterministic best-case projections.
References (all 2024 or earlier)
- Standish Group, "CHAOS Report 2020: Beyond Infinity," 2020.
- Flyvbjerg B., "Survival of the Unfittest," Oxford Review of Economic Policy 25(3), 2009.
- UK HM Treasury, "The Green Book: Appraisal and Evaluation," 2022.
- OECD, "Government at a Glance 2023," OECD Publishing, 2023.
- World Bank, "GovTech Maturity Index," 2022.
- UK NAO, "HMRC Tax Compliance," HC 978, Session 2022-23.
- OECD, "Tax Administration 2023," OECD Publishing, 2023.
- Frey C.B. & Osborne M.A., "Future of Employment," Tech. Forecasting & Social Change 114, 2017.
- Janssen M. et al., "Data governance for trustworthy AI," GIQ 37(3), 2020.
- IMF, "World Economic Outlook," Oct 2024.
- IBGE, "Continuous PNAD," Jul 2024.
- Longinotti F.P., "Tax Gap in LAC," CIAT Working Document 5866, 2024.
- Chambers and Partners, "Tax Controversy 2024: Brazil," 2024.
- CNJ, "Justica em Numeros 2024," 2024.
- UN DESA, "E-Government Survey 2024," Sep 2024.
- GASTAT, "Labour Force Survey Q3 2024," 2024.
- Saudi MOF, "Budget Statement FY2024," 2023.
- IRS, "ROI in Tax Enforcement," Publication 1500, 2023.
- Singapore BCA, "Annual Report 2022/2023," 2023.
- Mehr H., "AI for Citizen Services," Harvard Ash Center, 2017.
Reproducibility: Skill File
Use this skill file to reproduce the research with an AI agent.
--- name: govai-scout description: > Monte Carlo framework for realistic government AI investment appraisal. Models five empirically-documented failure modes (procurement delays, cost overruns, political defunding, adoption ceilings, optimism bias) that standard ROI calculators ignore. Demonstrates ~2.5x overestimation in naive projections across Brazil and Saudi Arabia case studies. allowed-tools: Bash(python *), Bash(pip *) --- # GovAI-Scout: Failure-Adjusted Government AI Investment Analysis Standard government AI ROI calculators overestimate returns by ~2.5x because they ignore procurement delays, cost overruns, and political defunding. This framework fixes that with Monte Carlo simulation using empirically-calibrated government failure modes (Standish CHAOS 2020, Flyvbjerg 2009, HM Treasury 2022). ## Results | | Brazil (Tax) | Saudi (Municipal) | |---|---|---| | Naive NPV | BRL 8,420M | SAR 2,870M | | Adjusted NPV | BRL 3,361M | SAR 1,119M | | Overestimation | **2.5x** | **2.6x** | | P(NPV>0) | 81.5% | 84.5% | | P5 worst case | BRL -679M | SAR -378M | ## Execution ```bash pip install numpy scipy pandas matplotlib seaborn --break-system-packages python govai_scout_v4.py ```
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.