2026-06-22

IDK-1 at 47,500 steps: what the output actually looks like

step 47,500 out of 100,000. almost halfway. here's the honest update.

val loss is 7.79–7.81. it has been 7.79–7.81 since step 20,000. if you're looking at that number and thinking 'the model isn't learning anymore' — that's a fair read. it's also wrong.

what's actually happening: the learning rate is still relatively high (~1.8e-04 at step 47K on a cosine schedule from 3e-04). at high LR, the optimizer is taking large steps and the loss oscillates instead of descending cleanly. the real drop happens in the back half, when LR gets small enough that the model can settle into a minimum instead of bouncing around one. best val loss so far is 7.7908, hit at step 29K. everything since then has been within 0.02 of that. not stuck — just in a holding pattern until the schedule kicks in.

at step 47K, the model has seen roughly 1.55 billion tokens. out of 2.64 billion in the dataset. so it's seen most of the data once, but not twice. transformers generally need multiple passes over data before the patterns really stick. we're not there yet.

i ran inference for the first time today. loaded the step 37,500 checkpoint on a kaggle T4, gave it five prompts. here's what came out for 'Indonesia adalah negara':

'Indonesia adalah negara yang. dan, pada. tidak yang dari tidak di untuk bisa karena Anda di, lebih ini.'

that's not coherent. it's also not random. every word there is a real indonesian word. 'yang', 'dan', 'di', 'tidak', 'bisa', 'karena' — these are all valid. the model knows it's generating indonesian. it just doesn't know how to connect them yet. it's learned co-occurrence, not grammar. it knows 'tidak' often appears near 'bisa'. it doesn't know why or when.

for comparison: a completely untrained model produces garbage characters and impossible token sequences. IDK-1 at 37.5K is generating coherent vocabulary in a broken sentence structure. that's the correct place to be at this stage.

the inflection point i'm watching for is around step 60–70K. that's typically where models on this data scale start producing multi-word phrases that actually parse. not coherent paragraphs — just phrases. 'perkembangan ekonomi indonesia' instead of 'perkembangan. dan, ini yang.'

hardware is the same: two T4s on kaggle free tier, ~32,768 tokens per step. the second kaggle account (girlfriend's, thanks) is carrying this session. when it runs out, wait for quota reset, load from checkpoint, continue. no drama. that's been the whole strategy.

52,500 steps left. slow is fine.

← back