2026-06-21

IDK-1 week 1: what 8,000 steps of training actually looks like

week 1. step 8,150 out of 100,000. here's what happened.

the model started at val loss 10.74. that's basically random — it knows nothing. by step 500 it was already at 7.81. the first few hundred steps are always the dramatic part. the model goes from "what is language" to "okay i've seen some patterns" very fast.

then it plateaued. and stayed there for a while.

from step 500 to step 8000, val loss has been oscillating between 7.79 and 7.82. if you just look at the numbers without context, it looks like the model isn't learning. it is — just slowly. the reason is simple: we're at 8% of the training data. 8,000 steps out of 100,000. the model has seen roughly 200M tokens out of 2.64B. that's not enough to push loss down aggressively yet.

the benchmark i set for week 1: val loss below 7.80 by step 5,000. it hit 7.7955 at step 4,500 and 7.7966 at step 5,000. benchmark cleared.

what "val loss 7.80" actually means: the model can predict the next token with some consistency, but it's still mostly lost. at this stage it's learning basic indonesian word co-occurrence, not semantics. it knows "saya" often comes before "tidak", not why.

the real drop should come around step 20k-30k, when we hit 20-30% of the data. that's where most models start showing something resembling language understanding. we're not there yet.

hardware: 2x T4 on Kaggle free tier. ~15,300 tokens/second. checkpoints saved at step 2500, 5000, and 7500. if the session dies, we resume from the last checkpoint. no drama.

ETA to 100k steps: ~54 hours of compute. split across two kaggle accounts. if that runs out, probably Vast.ai.

that's week 1. nothing broke. the loss is going the right direction. slow is fine.

← back