Benchmark GIR vs Interpreter

Shared source of truth for benchmarking the compiled GIR evaluator against the interpreter (MCP) path.

Run benchmark_gir.py to compare the compiled and interpreter evaluation paths on an example fitting workflow.

Available examples

ls -d examples/fitting_workflows/0[0-9]_*/ 2>/dev/null | \
  grep -v _fits | \
  while read -r d; do printf '  %s\n' "$(basename "$d")"; done

Lowerability is checked per-node by can_lower_2d(); there is no blanket exclusion for convolution or subcycle dynamics — both lower when their structural contracts are satisfied (resolved-trace time-domain convolution, subcycle substeps compiled into schedule arrays). The examples exercise different GIR paths:

#	example	GIR path exercised
1	`01_basic_fitting`	convolution (`MonoExpPosIRF` -> `*CONV` kernel)
2	`02_dependent_parameters`	plain dynamics, no conv/subcycle/profile (default)
3	`03_multi_cycle`	subcycle dynamics
4	`04_par_profiles`	profile models
5	`05_project_level_fitting`	not currently supported by the benchmark harness

Example 02 is the default because it is the cleanest baseline comparison (pure dynamics, no side paths).

Task

Parse the arguments:

First positional integer -> --example N (default: 2)
--fit -> include full-fit benchmark
-n N -> fit repetitions (default: 3)

Run:

.venv/bin/python .claude/skills/benchmark/benchmark_gir.py --example <N> --calls 200 [--fit] [-n <N>]

Report the results to the user. Highlight the speedup ratio, the Max |diff| correctness check, and note which GIR path the example exercises (convolution / subcycle / profile / plain).

Fit-count and planning-cost modes

Two additional modes report operational characteristics of the fit rather than a head-to-head speedup:

--nfev — run the standard baseline + fit_2d pipeline and report the total number of residual evaluations per stage. Useful when checking whether a change inflates the fit work (not just the per-call cost).
--plan-time — measure build_graph + schedule_2d cost against the total fit_2d wall time. Useful for confirming that planning overhead stays negligible relative to the fit itself.

Both modes accept --example 0 to run across all examples and print a summary table at the end.

.venv/bin/python .claude/skills/benchmark/benchmark_gir.py --example <N> --nfev
.venv/bin/python .claude/skills/benchmark/benchmark_gir.py --example <N> --plan-time

Profiling (GIR path only)

For flamegraphs of the GIR hot path, use --profile to run a GIR-only loop (no interpreter path, no correctness check, no prints inside the loop) and attach py-spy to the subprocess.

Prerequisite (one-time):

.venv/bin/pip install -e ".[profiling]"

Invocation:

.venv/bin/py-spy record --rate 500 -o docs/design/benchmarks/gir_profile.svg -- \
  .venv/bin/python .claude/skills/benchmark/benchmark_gir.py --example <N> --profile

py-spy needs permission to attach to the child process. On Linux this requires either sudo or sysctl kernel.yama.ptrace_scope=0.