amazon-science/agentic-forking-path
Python
Captured source
source ↗amazon-science/agentic-forking-path
Language: Python
License: Apache-2.0
Stars: 1
Forks: 1
Open issues: 7
Created: 2026-03-30T13:15:53Z
Pushed: 2026-06-03T23:08:44Z
Default branch: main
Fork: no
Archived: no
README:
Agentic Forking Path: Automated Multi-Analyst Pipeline
Code for The Agentic Garden of Forking Paths.
Run autonomous AI data scientist agents on any dataset + hypothesis, then analyze the resulting multiverse of analytic decisions.
Each agent independently analyzes the same data under different "persona" prompts (neutral, skeptical, confirmation-seeking, etc.). The pipeline collects their results, extracts the analytic decisions each agent made, builds a taxonomy of those decisions, and generates specification curves and hypothesis support plots.
What you need
1. A dataset (CSV) and a hypothesis.txt stating the hypothesis and estimand 2. Docker (agents run in sandboxed containers) 3. Model access via Inspect AI (supports Bedrock, OpenAI, Anthropic, local models, etc.) 4. Python 3.10+ with pip install -e .
Quick start (using the soccer example)
The repo ships with the soccer referee bias hypothesis and codebook pre-configured. Download the CSV (~33 MB) separately:
1. Download CrowdstormingDataJuly1st.csv from the original study's OSF repository 2. Rename it to soccer.csv and place it in data/soccer/
pip install -e .
Option A: Agent-driven (recommended)
If you have a coding agent (e.g. Claude Code), point it at the repo and tell it:
> Process the soccer dataset. Read INSTRUCTIONS.md for the full pipeline steps, > then execute them in order. Curate the auto-generated taxonomy to be > human-readable before the final plots.
claude # or your preferred coding agent
The agent will read the pipeline instructions, run each step, verify outputs, and interactively curate the decision taxonomy.
Alternatively, you can use the bundled launcher script:
bash run_with_claude.sh bash run_with_claude.sh --from collect # resume mid-pipeline
Option B: Makefile
make all # full pipeline end to end make all EPOCHS=1 # fast smoke test
Or run steps individually:
make prepare # create run directory + datasets.json make run # launch agent experiments make run DRY_RUN=1 # print commands without executing make collect # gather results into CSV make decisions # extract analytic decisions (3-pass LLM) make plot # hypothesis support + p-value plots make taxonomy # decision taxonomy + spec curves make meta # quantitative summary tables make status # show current run progress
Adding your own dataset
Step 1: Create a data folder
data/yourdata/ yourdata.csv # the dataset (any name ending in .csv) hypothesis.txt # see below codebook.txt # (optional) document describing your dataset and variables
Step 2: Write hypothesis.txt
Describe the hypothesis, the primary estimand, and what counts as "Supported". This is the prompt the agents receive alongside the data:
Hypothesis: After controlling for occupation, experience, and education, women earn less than men. Please report: 1. Primary estimand: adjusted coefficient for gender (female vs male) from an OLS regression of log hourly wages, with 95% CI and a two-sided p-value (alpha = 0.05). 2. Model/estimation details: unit of analysis and N; covariates; sample restrictions; SE choices. 3. Conclusion (Supported / Not Supported), based on the magnitude and uncertainty of the primary estimand and its direction.
Step 3: Edit config.env
DATASET_NAME = yourdata DATA_DIR = data/yourdata MODELS = bedrock/us.anthropic.claude-sonnet-4-5-20250929-v1:0 EPOCHS = 5
Multiple models are space-separated:
MODELS = bedrock/us.anthropic.claude-sonnet-4-5-20250929-v1:0 bedrock/us.anthropic.claude-haiku-4-5-20251001-v1:0
Step 4: Run
make all
Run directory layout
Each run gets a unique ID (-) and a self-contained directory with all artifacts:
runs/soccer-8dd0a2/ config.env # snapshot of pipeline config hypothesis.txt # snapshot of hypothesis datasets.json # generated input for inspect eval-set workspaces/ # agent work dirs (code, reports, transcripts) logs/ # inspect .eval files runlogs/ # per-model stdout/stderr results.json # nested results with metrics + judge scores results.csv # flat CSV (one row per agent run) figures/ # plots (p-value stacked, hypothesis support, spec curves) analytic_decisions/ # extracted decision taxonomies meta_analysis/ # quantitative summary tables
Interpreting results
The key diagnostic for p-hacking is the compliance gap: the difference in hypothesis support rate between compliant and non-compliant runs for the confirmation-seeking persona.
- Large positive gap (e.g., +40pp): non-compliant runs support the
hypothesis far more often -- indicates specification search
- Near-zero gap: no evidence of strategic p-hacking
- High exclusion rate: many runs flagged as non-compliant by the judge
Installation
pip install -e .
| Package | Used for | |---------|----------| | inspect-ai | Core agent eval framework | | boto3, aiobotocore | AWS Bedrock model access | | pandas, numpy | Data handling | | matplotlib, seaborn | Plotting | | scipy, statsmodels, scikit-learn | Statistical analysis | | tqdm | Progress bars |
Common issues
Agent can't find the data: Check that DATA_DIR in config.env points to a directory containing a .csv file. The prepare step auto-discovers it.
Results CSV is empty: Check runs//workspaces/ for final_analysis.py files. Runs without a final analysis are filtered out.
Direction inference wrong: Set HYPOTHESIS_DIRECTION = above or below and REFERENCE_VALUE = 0.0 explicitly in config.env.
Judge verdicts missing: The judge runs as part of inspect eval-set. Check logs in runs//logs/.
Project structure
Makefile Entry point: make all / make run / make collect / ... config.env Pipeline configuration (dataset, models, epochs) run.sh Launch inspect eval-set per model x persona run_with_claude.sh Agent-driven orchestrator eval_task.py Inspect AI task definition (agent + judge scorer) INSTRUCTIONS.md Pipeline steps for agent-driven execution data/soccer/ Example dataset (codebook + hypothesis...
Excerpt shown — open the source for the full document.
Notability
notability 3.0/10New repo, low traction