Decontaminated Endpoint Scaling Law (Compute Vs Final Loss)

Same Chinchilla-style 3-parameter fit as the main report: \(L_\infty(C) = E + A\,(C/10^{18})^{-\alpha}\), fit per \((\textrm{mix}, \textrm{LR}, \textrm{val set})\) on the small ladder (3e18 → 3e20) by \(\texttt{scipy.optimize.curve\_fit}\) with \(E < \min y\), \(A,\alpha \ge 0\) — identical code path (\(\texttt{fit\_floor\_power}\)). The four val sets: the original (contaminated) 12,500-window math val as anchor, and the paranoid decon sets dropping val docs with any verified train near-duplicate at Jaccard ≥ 0.90 / 0.75 / 0.50. Losses come from the decon eval sweep (one v6e-4 job per checkpoint, all four sets evaluated in-harness together). 1e21 and 1e22 actuals are triangles — never used by the fit below the cutoff. The log-log 2-parameter fit is available as a toggle.

fit type mix / learning rate

val sets

fit through 3e20

Pick a (mix, learning rate) cell; the plot shows one line per val set — original anchor vs the three decon cutoffs. The contamination signature: the anchor's fit bends down harder at large compute (memorization credit), while the J≥0.50 set is the honest curve. Open circles mark training cells dropped by the current cutoff.

Decontaminated Endpoint Scaling Law (Compute Vs Final Loss)

Held-Out Predictions

1e21

1e22

Per-Recipe Fit Quality