# Fit-archive HDF5 schema (schema_version 2)

On-disk layout for the fit-results archive written by `Project.save_fits()`
and read by `FitResults.load()` / `Project.load_fits()`. The object model
([utils/fit_io.py](../../src/trspecfit/utils/fit_io.py)) is the source of
truth; this document specifies the 1:1 mapping to HDF5 so the writer and
reader agree on dtypes, attr keys, and None-handling.

For the design rationale (why per-slot `observed`, why two identity keys,
why HDF5 instead of pickle, etc.), see the archived design plan,
[Fit Results Save/Load](archive/fit_results_save_load_plan.md).
This file is the wire format.

## Conventions

- **Group-path components are positional, zero-padded six-digit keys**
  (`000000`, `000001`, ...). HDF5 path components forbid `/`, and
  user-meaningful names (`File.name`, `model_name`, etc.) can contain
  arbitrary characters. Identity lives in attrs, never in path segments.
- **Strings in attrs and string-typed dataset fields use
  `h5py.string_dtype(encoding="utf-8")`** (variable-length UTF-8). Fixed-
  length string types are not used; readers must not assume any length.
- **`None` handling**:
  - For optional **strings** (e.g. `yaml_filename`): omit the attr
    entirely. The reader treats absence as `None`.
  - For optional **integer-pair lists** (`e_lim`, `t_lim`): omit the attr
    entirely (do not write a sentinel array).
  - For optional **floats** like `stderr` inside structured arrays: write
    `np.nan`. The reader maps NaN back to `None` only for fields where the
    object model permits `None` (`stderr`); other float fields are kept as
    floats.
  - For optional **strings** inside structured arrays (long-form params
    `expr`): write `""`. The reader maps `""` back to `None` on
    columns where the object model permits `None` (lmfit's `expr=None`).
  - These slot-specific ``↔`` mappings are applied by the slot reader,
    not the generic DataFrame decoder. ``conf_ci``, ``mcmc/flatchain``,
    ``mcmc/ci``, and sbs ``params`` carry no None semantics; their
    literal `""` / `NaN` values are data.
  - For optional **groups/datasets** (`conf_ci`, `mcmc/`): omit the
    group/dataset. The reader treats absence as `None`.
- **Float dtype**: structured-array float fields and metric attrs are
  `float64`. The four user-array datasets — file `data` / `energy` /
  `time` and slot `observed` / `fit` — are written in **the source
  array's native dtype** (typically `float64` or `float32`). This
  preserves byte-for-byte equivalence with the inputs, which the
  fingerprints (`data_sha256`, `energy_sha256`, `time_sha256`) and
  `observed_sha256` rely on. The reader does not re-cast.
- **Integer dtype**: positional/index attrs and shape are `int64`.
- **Bool**: `vary` in params is HDF5 `bool` (numpy `?`).
- **Tuple-valued attrs** (`shape`): stored as 1D `int64` arrays.

### DataFrame encoding

All persisted `pd.DataFrame` payloads (slot `params`, `conf_ci`,
`mcmc/flatchain`, `mcmc/ci`) follow one uniform rule so the writer/reader
has a single code path and column labels never collide with HDF5 field-
name restrictions:

1. **All-numeric DataFrames** (homogeneous `float64` columns, e.g. sbs
   `params`, `flatchain`): 2D `float64` dataset of shape
   `(n_rows, n_cols)`, plus attr `columns` — a 1D vlen-utf8 array of
   length `n_cols` listing the column labels in axis-1 order.

2. **Heterogeneous-dtype DataFrames** (e.g. baseline/spectrum/2d
   `params`, `conf_ci`, `mcmc/ci`): 1D structured dataset of shape
   `(n_rows,)` with fields named positionally `c000000`, `c000001`, ...
   (zero-padded six-digit, matching the group-key convention). Each
   field's dtype is chosen per-column from `{vlen str, float64, bool}`.
   Attr `columns` (1D vlen-utf8 array, length = field count) gives the
   actual column labels in field order. Attr `dtypes` (1D vlen-utf8
   array, same length) gives a short type tag per column from
   `{"str", "float64", "bool"}` so the reader can rebuild the DataFrame
   without inferring dtypes back.

This convention isolates HDF5 from arbitrary user-facing labels (e.g.
sigma columns like `"+1"`, `"best fit"`, or future column-renames in
`par_to_df`) without giving up structured-array benefits for mixed
dtypes.

## Top-level layout

```
<archive>.fit.h5
├── metadata                                # group; identity attrs only
│   attrs:
│     trspecfit_version  : str              # e.g. "0.4.0"; updated on every write
│     project_name       : str              # Project.name; set on first write
│     timestamp_created  : str              # ISO 8601 UTC, first write
│     timestamp_updated  : str              # ISO 8601 UTC, most recent write
│     schema_version     : str              # "2"; bump on incompatible change
└── files/                                  # group; one subgroup per file
    ├── 000000/                             # SavedFile (see "File group")
    └── 000001/...
```

`save_fits` is slot-scoped (a single archive may be written multiple
times as new fits accumulate), so the archive carries both
`timestamp_created` (set once when the file is first opened with mode
`"w"`) and `timestamp_updated` (rewritten on every save). The writer
must not recreate the archive on subsequent saves unless the caller
explicitly asks for that; the canonical way to start fresh is to choose
a new path.

`schema_version` is currently `"2"`. It was bumped from `"1"` before this
branch shipped, when the σ-calibrated chi-square columns and per-slot sigma
metadata changed the stored fields — a clean break, so archives written by
the older schema can no longer be read. Future incompatible changes (e.g.
project-scoped joint-result slots or `keep_history=True` full-log save —
both deferred, see "What's *not* in v1") bump it again. The reader rejects
archives with a `schema_version` it does not recognize. (This wire-format
number is independent of the feature-scope "v1" used elsewhere in this doc.)

## File group

```
files/000000/
├── metadata                                # group, no datasets; carries identity attrs
│   attrs:
│     name           : str                  # File.name
│     original_path  : str                  # absolute path of source file at save time
│     dim            : int64                # 1 or 2
│     shape          : int64[ndim]          # data.shape as 1D array
│     data_sha256    : str                  # 64 hex chars
│     energy_sha256  : str                  # 64 hex chars
│     time_sha256    : str                  # 64 hex chars; "" for 1D files
│     e_lim          : int64[2]   (opt)     # [start, stop) index slice; omit if None
│     t_lim          : int64[2]   (opt)     # [start, stop) index slice; omit if None
├── energy                                  # 1D dataset; preserves source dtype
├── time                                    # 1D dataset; length 0 if 1D file; preserves source dtype
├── data                                    # 1D (1D file) or 2D (n_t, n_e) dataset; preserves source dtype
└── slots/
    ├── 000000/                             # SavedFitSlot (see "Slot group")
    └── 000001/...
```

Notes:

- The full data + axes are duplicated into the archive deliberately
  (decision in the archived design plan — "Self-contained archive"). On load, the reader
  hands these back via `SavedFile`; the live `Project` is not mutated.
- `data_sha256`, `energy_sha256`, `time_sha256` together with `shape`
  form the `file_fingerprint` used to match an archive's file to a
  `Project.files[*]` (or to another archive). See
  `compute_file_fingerprint` in `utils/fit_io.py`.

### Identity collisions

Two distinct rules apply, in two different directions:

- **Archive uniqueness (write side).** A file group's effective identity
  is `(file_fingerprint, name, original_path)`. Two source files with
  byte-identical `data` / `energy` / `time` but different `name` or
  `original_path` are stored in **separate** file groups. Files agreeing
  on all three are treated as the same file (one group, slots merge).
  This means the writer's "find existing file group" lookup
  (`_find_file_by_fingerprint`) must compare `name` / `original_path`
  in addition to fingerprint when more than one candidate matches.

- **Live-Project matching (read side).** When a `FitResults` archive is
  loaded and the caller wants to align archive files with
  `Project.files[*]`, fingerprint is the primary key, and `name` /
  `original_path` are tie-breakers if multiple candidates match. The
  loader does not require an exact `original_path` match — that path is
  baked at save time and may not exist on the loading machine.

The asymmetry is deliberate: at write time we want strict separation of
intentionally-distinct files; at read time we want forgiving matching
that survives copying the archive between machines.

## Slot group

```
files/000000/slots/000000/
├── metadata                                # group; identity + provenance + (non-sbs) metrics in attrs
│   attrs:
│     # --- identity ---
│     file_ref          : str               # "files/000000" (archive-local)
│     model_name        : str               # SavedFitSlot.model_name
│     fit_type          : str               # "baseline" | "spectrum" | "sbs" | "2d"
│     selection_json    : str               # SavedFitSlot.selection_json
│     archive_slot_key  : str               # sha256(file_ref|model_name|fit_type|selection_json)
│     history_key       : str               # in-memory key from save time; non-authoritative
│     observed_sha256   : str               # 64 hex chars
│     # --- provenance ---
│     fit_alg           : str               # e.g. "leastsq", "Nelder"
│     yaml_filename     : str        (opt)  # human breadcrumb; omit if None
│     timestamp         : str               # ISO 8601 UTC, slot creation time
│     # --- metrics (baseline / spectrum / 2d only) ---
│     chi2              : float64   (cond)
│     chi2_red          : float64   (cond)
│     r2                : float64   (cond)
│     aic               : float64   (cond)
│     bic               : float64   (cond)
├── params                                  # see "params dataset" below; layout depends on fit_type
├── observed                                # 1D or 2D dataset; preserves source dtype
├── fit                                     # 1D or 2D dataset; preserves source dtype; observed.shape == fit.shape
├── metrics_per_slice                (opt)  # 1D structured dataset; sbs only
├── conf_ci                          (opt)  # heterogeneous-DataFrame dataset; see "conf_ci dataset"
└── mcmc/                            (opt)  # see "mcmc group"
```

`(cond)` = present iff `fit_type != "sbs"`. SbS metrics live in the
`metrics_per_slice` dataset because they are per-slice arrays, not
scalars.

`(opt)` = present iff the corresponding `SavedFitSlot` field is non-`None`
(`conf_ci`, `mcmc`) or applicable to the fit type
(`metrics_per_slice` is sbs-only).

### `archive_slot_key` vs `history_key`

The authoritative on-disk slot key is `archive_slot_key`, computed at
save time once the file's archive position is known:

```
archive_slot_key = sha256(file_ref | model_name | fit_type | selection_json)
```

Both keys exist for the same logical purpose (uniquely identify a slot);
they use different file-identity tokens because in-memory and on-disk
identity primitives differ (multi-sha fingerprint vs archive-local
positional path). `archive_slot_key` is what the writer's slot-scoped
overwrite check (`_find_slot_by_archive_key`) compares against.

`history_key` is also persisted as a non-authoritative attr (a debugging
aid for archive inspection and round-trip tests), but the reader
**recomputes** it from
`(file_fingerprint, model_name, fit_type, selection_json)` and uses the
recomputed value for the `SavedFitSlot`. The on-disk value is ignored
on read; it exists only so an external inspector (e.g. a notebook
poking at the HDF5 directly) can correlate slots to in-session history
without redoing the hash.

## `params` dataset

Two distinct shapes depending on `fit_type`, both following the
DataFrame-encoding rule from "Conventions".

### baseline / spectrum / 2d — long format (one row per parameter)

Heterogeneous-dtype DataFrame:

```
params : 1D structured dataset, shape (n_par,)
  fields (positional, in column order):
    c000000 : vlen str    # column "name"        (parameter name, e.g. "GLP_01_A")
    c000001 : float64     # column "value"
    c000002 : float64     # column "stderr"      (NaN ↔ lmfit returned None)
    c000003 : float64     # column "init_value"
    c000004 : float64     # column "min"         (-inf permitted)
    c000005 : float64     # column "max"         (+inf permitted)
    c000006 : bool        # column "vary"
    c000007 : vlen str    # column "expr"        ("" ↔ None)
  attrs:
    columns : vlen str[8] = ["name","value","stderr","init_value","min","max","vary","expr"]
    dtypes  : vlen str[8] = ["str","float64","float64","float64","float64","float64","bool","str"]
```

Mirrors the DataFrame returned by `par_to_df(..., col_type="min")` in
`utils/lmfit.py`. `stderr` is the only float column that legitimately
holds `NaN`-as-`None` — the others must always have a real value.
`min`/`max` may carry IEEE `-inf`/`+inf` (unbounded parameters); those
are written verbatim.

### sbs — wide format (one row per slice, one column per parameter)

All-numeric DataFrame:

```
params : 2D float64 dataset, shape (n_slices, n_par)
  attrs:
    columns : vlen str[n_par]   # parameter names; axis-1 order
```

Stores optimized values only — no init / stderr / min / max / vary /
expr. Mirrors `list_of_par_to_df(results)` in `utils/lmfit.py`. If full
per-slice metadata becomes useful later, add a sibling
heterogeneous-DataFrame dataset; do not redefine `params`.

## `metrics_per_slice` dataset (sbs only)

```
metrics_per_slice : 1D structured dataset, shape (n_slices,)
  dtype:
    chi2     : float64
    chi2_red : float64
    r2       : float64
    aic      : float64
    bic      : float64
```

Row order follows the time-slice order in `observed` axis 0. The reader
reconstructs `SavedFitSlot.metrics` as `{name: column_array}` for sbs.

## `conf_ci` dataset (optional)

Heterogeneous-dtype DataFrame (one string column for the parameter
name, the rest float):

```
conf_ci : 1D structured dataset, shape (n_par,)
  fields (positional, in column order):
    c000000 : vlen str         # column "parameter" (or whatever par_to_df produced)
    c000001 : float64          # first sigma column, e.g. "-3"
    c000002 : float64          # next, e.g. "-2"
    ...
    c00000K : float64          # last, e.g. "+3"
  attrs:
    columns : vlen str[K+1]    # actual column labels (e.g. ["parameter","-3",...,"+3"])
    dtypes  : vlen str[K+1]    # ["str","float64","float64",...,"float64"]
```

Sigma labels come from `conf_interval_to_df` in `utils/lmfit.py`
(typically `["-3", "-2", "-1", "best fit", "+1", "+2", "+3"]`). The
positional fields insulate HDF5 from arbitrary user-facing labels; the
`columns` attr restores them on read. Omitted entirely if
`SavedFitSlot.conf_ci is None`.

## `mcmc/` group (optional)

```
mcmc/
├── flatchain                               # all-numeric DataFrame
│   2D float64 dataset, shape (n_samples, n_par)
│   attrs:
│     columns : vlen str[n_par]             # parameter labels; axis-1 order
├── ci                                (opt) # heterogeneous-dtype DataFrame
│   1D structured dataset, shape (n_par,)
│   field/attr layout identical to conf_ci above
└── attrs:
      lnsigma : float64                     # __lnsigma point estimate
```

If `SavedFitSlot.mcmc is None`, the entire `mcmc/` group is omitted.
Within the group:

- `flatchain` is required when `mcmc/` is present, but may be empty if
  emcee returned an empty chain.
- `ci` is optional (emcee CI may not have been computed).
- `lnsigma` is required when `mcmc/` is present.

## Reader → object-model mapping

Per slot, the reader produces a `SavedFitSlot` with:

| `SavedFitSlot` field | Source                                                         |
|----------------------|----------------------------------------------------------------|
| `file_fingerprint`   | parent file group's `metadata` attrs                           |
| `file_name`          | parent file group's `metadata.name` attr                       |
| `model_name`         | slot `metadata.model_name` attr                                |
| `fit_type`           | slot `metadata.fit_type` attr                                  |
| `selection`          | `json.loads(metadata.selection_json)`                          |
| `selection_json`     | slot `metadata.selection_json` attr                            |
| `observed_sha256`    | slot `metadata.observed_sha256` attr                           |
| `history_key`        | recomputed from `file_fingerprint + model_name + fit_type + selection_json` |
| `params`             | `params` dataset (+ its `columns` attr) → DataFrame            |
| `metrics`            | scalar attrs (non-sbs) or `metrics_per_slice` (sbs) → dict     |
| `observed`           | `observed` dataset                                             |
| `fit`                | `fit` dataset                                                  |
| `fit_alg`            | slot `metadata.fit_alg` attr                                   |
| `yaml_filename`      | slot `metadata.yaml_filename` attr (None if absent)            |
| `timestamp`          | slot `metadata.timestamp` attr                                 |
| `conf_ci`            | `conf_ci` dataset → DataFrame, or `None` if absent             |
| `mcmc`               | `mcmc/` group → dict, or `None` if absent                      |

`history_key` is persisted as a non-authoritative attr but recomputed
by the reader (see "`archive_slot_key` vs `history_key`"). The on-disk
value is for debugging and external inspection only; the in-memory key
on the returned `SavedFitSlot` always comes from the live recompute.

## Per-fit-type cheat sheet

| fit_type   | `observed.shape`        | `params` layout                      | metrics location          | sbs-only datasets     | t_lim applied |
|------------|-------------------------|--------------------------------------|---------------------------|-----------------------|---------------|
| baseline   | `(n_e_view,)`           | structured (long, named columns)     | scalar attrs              | —                     | n/a           |
| spectrum   | `(n_e_view,)`           | structured (long, named columns)     | scalar attrs              | —                     | n/a           |
| sbs        | `(n_t_full, n_e_view)`  | 2D float64 + `columns` attr (wide)   | `metrics_per_slice`       | `metrics_per_slice`   | **no**        |
| 2d         | `(n_t_view, n_e_view)`  | structured (long, named columns)     | scalar attrs              | —                     | yes           |

`n_e_view` denotes the energy axis cropped by `e_lim`; `n_t_view`
denotes the time axis cropped by `t_lim`. `n_t_full` is the file's full
time-axis length: `fit_slice_by_slice` iterates every slice in
`File.data` regardless of `t_lim`, so `selection.t_lim` is always
`None` for sbs slots ([trspecfit.py:2987](../../src/trspecfit/trspecfit.py#L2987)).
`spectrum` and `baseline` reduce time via `time_point` / `time_range`
or `base_t_ind`, captured separately in `selection`.

Project-side: `Project.fit_2d()` produces ordinary `fit_type="2d"`
slots, one per file ([trspecfit.py:1004-1009](../../src/trspecfit/trspecfit.py#L1004-L1009)).
The archive does not distinguish them from slots produced by
`File.fit_2d()`.

## What's *not* in v1

- **Project-scoped joint-result slots.** `Project.fit_2d()` runs a joint
  multi-file fit but currently emits one ordinary `fit_type="2d"` slot
  per file (each carrying that file's projection of the joint result).
  There is no archive construct for a single "joint" slot that owns the
  shared parameter values without per-file duplication. The pipeline
  that would justify one is flagged as architecturally unfinished
  ([TODO.md](https://github.com/InfinityMonkeyAtWork/time-resolved-spectroscopy-fit/blob/main/TODO.md)
  — "Project-level fit backend"). Adding a
  joint slot later is a strict additive change: a new top-level group
  (e.g. `project_slots/`) and a schema-version bump; existing per-file
  2d slots stay untouched.
- **`keep_history=True` full-log save.** The default `Project.save_fits`
  collapses to latest-per-`history_key`. Persisting every refit needs a
  timestamp/sequence component in the slot key; deferred to v2.
- **Model rehydration.** `yaml_filename` is a breadcrumb; v1 does not
  promise to deserialize a `Model` from the archive.
- **MCMC trace metadata** (acceptance fraction, autocorrelation times,
  etc.) — only `flatchain` / `ci` / `lnsigma` are persisted. If the
  decoupled-MCMC follow-on (the archived design plan, "Out of scope") lands, that work owns
  the schema extension.

## Cross-references

- Object model + identity helpers: [src/trspecfit/utils/fit_io.py](../../src/trspecfit/utils/fit_io.py)
- `FitResults` query API: [src/trspecfit/fit_results.py](../../src/trspecfit/fit_results.py)
- Eager extraction call sites: `_append_*_slot` in [src/trspecfit/trspecfit.py](../../src/trspecfit/trspecfit.py)
- DataFrame builders the schema mirrors: `par_to_df`, `list_of_par_to_df`,
  `conf_interval_to_df` in [src/trspecfit/utils/lmfit.py](../../src/trspecfit/utils/lmfit.py)
- Structural precedent for HDF5 layout: `Simulator.save_data` in
  [src/trspecfit/simulator.py](../../src/trspecfit/simulator.py)