Delphi Midtraining Interactive Scaling

Interactive validation-trajectory prediction for math validation loss. Parametric prefix fits use SciPy optimization with MAE or Huber objectives; final model comparison uses endpoint MAE.

learning rates

Curve Prediction Within A LR / Mix / Flop Cell

This first setting treats each completed run as its own cell: fixed flop scale, fixed data mix, and fixed learning rate. For a selected prefix \(p\), each parametric method fits only that cell's validation points with normalized progress \(\tau \le p\), then predicts the endpoint at \(\tau=1\). Each run's endpoint error is \(|\hat L(1)-L(1)|\); aggregate tables report MAE as the mean of those absolute endpoint errors, plus the worst-case absolute error.

Shared endpoint model

\(L(\tau) = F + A\,g(\tau; \theta),\quad g(1;\theta)=0\)

\(F\) is the predicted final loss. \(A\) is the drop left from the prefix trajectory. Shape parameters \(\theta\) are initialized from a small grid and then optimized with SciPy.

Log

\(g(\tau;s)=\frac{\log((1+s)/(\tau+s))}{\log((1+s)/s)}\)

A sharp early drop with a long flattening tail.

Exponential

\(g(\tau;r)=\frac{e^{-r\tau}-e^{-r}}{1-e^{-r}}\)

A fast decay that asymptotes toward the endpoint.

Power

\(g(\tau;s,a)=\frac{(\tau+s)^{-a}-(1+s)^{-a}}{s^{-a}-(1+s)^{-a}}\)

A heavier-tailed curve; it can stay bendy deeper into training.

Rational

\(g(\tau;t_0,\beta)=\frac{(1+(\tau/t_0)^\beta)^{-1}-(1+(1/t_0)^\beta)^{-1}}{1-(1+(1/t_0)^\beta)^{-1}}\)

A smooth sideways-S / shoulder shape for trajectories that drop quickly then flatten.

MAE vs Huber Fit

\(\min_{F,A}\sum_{\tau_i \le p}\rho(L_i - F - A g(\tau_i;\theta))\)

MAE variants use bounded `scipy.optimize.minimize`; Huber variants use `scipy.optimize.least_squares(loss="huber")`. Shape parameters start from the small grid above, then SciPy optimizes \(F\), \(A\), and \(\theta\). All reported scores below are endpoint absolute error / MAE.

Per-Cell Within-Run Prediction

Lines are observed trajectories. Diamonds are observed finals; x markers are predicted finals. Parametric curve methods draw fitted continuations from the selected prefix to the final step.

Best Parametric Form By Scale And Prefix

Each cell shows the parametric curve variant with lowest final-loss MAE at that scale and prefix.

Target MAE Config Search

Pick a max absolute final-loss error target, then restrict the search to the mixture and learning-rate regime you care about. A config qualifies only if every completed run in that selected regime is at or below the target.

Selected Mix / LR

Per Scale In Selected Regime

Global reference across all mixes and learning rates

All Runs

All Runs Per Scale

Endpoint Scaling Law (Compute Vs Final Loss)

Chinchilla-style 3-parameter fit: \(L_\infty(C) = E + A\,(C/10^{18})^{-\alpha}\), where \(E\) is the irreducible-loss floor, \(C\) is base-model FLOPs, and \(L_\infty\) is final \(\texttt{math\_val\_loss}\). Fit per \((\textrm{mix}, \textrm{LR})\) on the small ladder (3e18 → 3e20) by \(\texttt{scipy.optimize.curve\_fit}\) with \(E < \min y\), \(A,\alpha \ge 0\). The 1e21 and 1e22 cells are never used by the fit; their actuals are plotted as triangles for the extrapolation check. The two-parameter log-log fit \(\log L = a + b\,\log C\) is available as a toggle for comparison — it lacks the asymptote so it under-predicts loss at very large compute.

learning rates

Slider sets the upper compute bound used to fit. Drop it to 2e20 to see how the held-out predictions degrade when the largest small-ladder cell is unavailable; drop further to see the fit collapse as the lever shrinks. Open circles mark training cells dropped by the current cutoff.

Held-Out Predictions (1e21, 1e22)

Per-Recipe Fit Quality

Joint Trajectory Fits Across LR, Mix, And Flop

This second setting follows the scaling-law-discovery idea more directly: fit shared trajectory regressions using \(\tau\), flop scale, data mix, and learning rate as features. The global scope fits across all flops, mixes, and learning rates. The by-flop scope fits within each flop scale, sharing only across mix and learning rate. For each prefix, fitting uses only points with \(\tau \le p\), then predicts endpoints at \(\tau=1\).

Source

The joint forms are inspired by Can Language Models Discover Scaling Laws? and its SLDBench project page. The key idea we are borrowing is shared structure over features, not an exact formula from the paper.

Joint Forms

\(z(\mathrm{flop},m,\eta)^\top\beta_0 + z^\top\beta_1 e^{-k\tau} + z^\top\beta_2\tau\)
\(z^\top\beta_0 + z^\top\beta_1(\tau+s)^{-\alpha} + z^\top\beta_2\tau\)
\(z^\top\beta_0 + z^\top\beta_1\exp(-b e^{-k\tau}) + z^\top\beta_2\tau\)

\(z\) contains mix, LR, and optionally log-flops with interactions. These are global regressions, not one curve per cell.

Joint Configs Meeting Target

Joint Per-Scale Configs