Today I came across Project APE, for automated policy evaluation (APE). The project uses Claude Code plus public observational datasets to generate end-to-end economics-style policy impact papers, from idea selection and data pulls to estimation, writeup, review, and reproducibility checks.

I am both intrigued and baffled by the project.

Baffled because Project APE is perhaps aping (pun intended) current methods too much. The process of evaluating policies one journal paper at a time is incredibly inefficient. Why would you ever want to replicate it with AI? Ideally, you’d want to leapfrog it.

Intrigued because it is a new take on causal discovery, which uses of computational methods at scale “to discover causal relations by analyzing statistical properties of purely observational data” (Glymour, Zhang, and Spirtes 2019).

Causal discovery

In traditional causal discovery the objective is to back out the causal directed acyclic graph that generated the observational dataset at hand. That is, you want to infer the causal relations amongst all variables in a given dataset. By contrast, policy evaluation typically targets a specific causal parameter. One describing the causal effect of a policy intervention \(X\) on some outcome of interest \(Y\). As such it focuses on a much narrower area of the graph.

Causal discovery typically relies on conditional independence testing and equivalance classes for causal identification. By contrast, policy evaluation relies on research design features like shocks and discontinuities that, at least locally, make the policy appear to have been “as if” randomized.

These difference notwithstanding, both causal discovery and automated policy evaluation have in common the use of “brute force” computational methods to identify causal relations in observational data.

Automated policy evaluation

In principle, you could feed an automated policy evaluation algorithm a query

$$ \mathcal{Q} = \{(\mathbf{x}_k, \mathbf{y}_k, \mathbf{Z}_k)\}_{k=1}^{K} $$

where (for simplicity) each element \(q_k := (\mathbf{x}_k,\mathbf{y}_k,\mathbf{Z}_k)\) corresponds to a single policy (\(\mathbf{x}_k\)) and a single outcome (\(\mathbf{y}_k\)) measured on the same sample of \(N_k\) individuals with covariates \(\mathbf{Z}_k\). (You could create this query manually, or train another model to generate queries.)

Given a query \(\mathcal{Q}\), the algorithm would then use web search agents to learn about the context of each query \(q_k\) to come up with the best possible identification strategy.

The final output would then be an answer to each query of the form

$$ \mathcal{A} = \{\text{estimand}_k, \text{estimator}_k, \text{estimate}_k, \text{confidence}_k, \text{replication file}_k\}_{k=1}^{K} $$

where the estimand might be the average treatment effect, or the local average treatment effect, say; the estimator might be OLS, or Instrumental Variables; the estimate is a point estimate and its standard error; the confidence is some measure of the credibility of the identification strategy (crudely, natural experiment > matching); and the replication file has all the identification assumptions and details needed for a fully automated procedural replication (Martel García 2016).

Next, you could sort \(\mathcal{A}\) by some function of policy relevance and confidence, then pick the top ranked answers for deeper dives involving humans and experiments. In this way APE becomes like a giant hypothesis pre-processor. A first pass, if you will, at identifying promising policies using observational data.

What does automated policy evaluation buy us

Automated policy evaluation might increase the stock and flow of impactful, high confidence answers. At any point in time there is a fixed stock of data (although the supply of available data might be quite elastic). Suppose there are \(R\) impactful answers yet to be discovered in this stock of data. With APE we might get to them quicker. This is the stock effect.

In addition, each year the world generates new policies, outcomes, and data, and with it additional impactful answers to be discovered. With APE we might be able to identify these novel answers just as soon as they are in the data. This is the flow effect.

Overall APE increases knowledge production.

That said, the impact on actual welfare might be much smaller, both in terms of the stock and flow adjustments. Knowing the right policy is a necessary but not sufficient condition for improving outcomes. Arguably, it is seldom the binding constraint.

First, even if an answer \(a_k\) is very high confidence, high impact, and compelling, it does not follow that it will be adopted. The bane of political economy is the staggering number of suboptimal policies that survive in the political equilibrium. Not because we don’t know any better, but because the constellation of winners, losers and public choice procedures is such that change is nearly impossible.

For example, while most of the world moves towards driverless subways, NYC maintains a driver and a conductor in each train, in part due to union contracts. Demonstrating that a one or zero person operation is just as safe and effective as two person one will likely do very little to change the political equilibrium.

Second, field experiments may be required in the more common scenario where answers are not clear cut. In the social sciences such experiments can take a long time. Unless, that is, we forgo the whole writing, peer review, printing and publication process (though we would still need dissemination efforts if we are to change behavior).

RCT Timeline showing research progression from idea through dissemination over
7 years, with peer review and printing phases highlighted between years 5 and
6 Source: Fletcher and Fletcher (2003)

Ideally, we would bypass the old way of doing things. In practice, the old way of doing things is itself a political equilibrium. Will AI help change the equilibrium? My sense is that it will, but slowly, and in subtle ways.

References

Fletcher, Robert H, and Suzanne W Fletcher. 2003. “The Effectiveness of Journal Peer Review.” Peer Review in Health Sciences 2: 62–75.

Glymour, Clark, Kun Zhang, and Peter Spirtes. 2019. “Review of Causal Discovery Methods Based on Graphical Models.” Frontiers in Genetics 10 (June). https://doi.org/10.3389/fgene.2019.00524.

Martel García, Fernando. 2016. “Replication and the Manufacture of Scientific Inferences: A Formal Approach.” International Studies Perspectives, February. https://doi.org/10.1093/isp/ekv011.


Enjoyed this post? Subscribe via RSS to get new articles delivered to your feed reader.