Curve Prediction Within A LR / Mix / Flop Cell
This first setting treats each completed run as its own cell: fixed flop scale, fixed data mix, and fixed learning rate. For a selected prefix \(p\), each parametric method fits only that cell's validation points with normalized progress \(\tau \le p\), then predicts the endpoint at \(\tau=1\). Each run's endpoint error is \(|\hat L(1)-L(1)|\); aggregate tables report MAE as the mean of those absolute endpoint errors, plus the worst-case absolute error.
Shared endpoint model
\(F\) is the predicted final loss. \(A\) is the drop left from the prefix trajectory. Shape parameters \(\theta\) are initialized from a small grid and then optimized with SciPy.
Log
A sharp early drop with a long flattening tail.
Exponential
A fast decay that asymptotes toward the endpoint.
Power
A heavier-tailed curve; it can stay bendy deeper into training.
Rational
A smooth sideways-S / shoulder shape for trajectories that drop quickly then flatten.
MAE vs Huber Fit
MAE variants use bounded `scipy.optimize.minimize`; Huber variants use `scipy.optimize.least_squares(loss="huber")`. Shape parameters start from the small grid above, then SciPy optimizes \(F\), \(A\), and \(\theta\). All reported scores below are endpoint absolute error / MAE.
Per-Cell Within-Run Prediction
Lines are observed trajectories. Diamonds are observed finals; x markers are predicted finals. Parametric curve methods draw fitted continuations from the selected prefix to the final step.
Best Parametric Form By Scale And Prefix
Each cell shows the parametric curve variant with lowest final-loss MAE at that scale and prefix.
Target MAE Config Search
Pick a max absolute final-loss error target, then restrict the search to the mixture and learning-rate regime you care about. A config qualifies only if every completed run in that selected regime is at or below the target.
Selected Mix / LR
Per Scale In Selected Regime
Global reference across all mixes and learning rates
All Runs
All Runs Per Scale
Endpoint Scaling Law (Compute Vs Final Loss)
Chinchilla-style 3-parameter fit: \(L_\infty(C) = E + A\,(C/10^{18})^{-\alpha}\), where \(E\) is the irreducible-loss floor, \(C\) is base-model FLOPs, and \(L_\infty\) is final \(\texttt{math\_val\_loss}\). Fit per \((\textrm{mix}, \textrm{LR})\) on the small ladder (3e18 → 3e20) by \(\texttt{scipy.optimize.curve\_fit}\) with \(E < \min y\), \(A,\alpha \ge 0\). The 1e21 and 1e22 cells are never used by the fit; their actuals are plotted as triangles for the extrapolation check. The two-parameter log-log fit \(\log L = a + b\,\log C\) is available as a toggle for comparison — it lacks the asymptote so it under-predicts loss at very large compute.
Slider sets the upper compute bound used to fit. Drop it to 2e20 to see how the held-out predictions degrade when the largest small-ladder cell is unavailable; drop further to see the fit collapse as the lever shrinks. Open circles mark training cells dropped by the current cutoff.
Held-Out Predictions (1e21, 1e22)
Per-Recipe Fit Quality
Joint Trajectory Fits Across LR, Mix, And Flop
This second setting follows the scaling-law-discovery idea more directly: fit shared trajectory regressions using \(\tau\), flop scale, data mix, and learning rate as features. The global scope fits across all flops, mixes, and learning rates. The by-flop scope fits within each flop scale, sharing only across mix and learning rate. For each prefix, fitting uses only points with \(\tau \le p\), then predicts endpoints at \(\tau=1\).
Source
The joint forms are inspired by Can Language Models Discover Scaling Laws? and its SLDBench project page. The key idea we are borrowing is shared structure over features, not an exact formula from the paper.
Joint Forms
\(z\) contains mix, LR, and optionally log-flops with interactions. These are global regressions, not one curve per cell.