i cancelled 53,000 steps of training because of one line of code
i was wrong in the last post.
i said the plateau at val loss 7.8 was because the learning rate was high — that the optimizer was bouncing around a minimum instead of settling into one. i said the real drop would come in the back half of training. i said 'slow is fine.'
none of that was true. i cancelled the run at step 53,850 this morning.
here's what actually happened.
the data pipeline has a merge step. after tokenizing Wikipedia ID and CulturaX ID into two separate binary files, you merge them into one train.bin. the original code did this by reading the data in chunks of 1 million tokens and shuffling each chunk before writing it out. the intention was to mix the two sources. the implementation was wrong.
np.random.shuffle(chunk) shuffles individual token IDs, not documents. so instead of mixing Wikipedia chunks with CulturaX chunks, it was producing 1 million random token IDs in sequence. the training data was not indonesian text. it was a flat array of tokens in random order.
the model never saw a coherent sentence. not once. every 512-token training window was a random permutation of token IDs with no sequential relationship. you cannot learn grammar from that. you cannot learn syntax. you can only learn which tokens appear frequently — unigram statistics.
val loss 7.8 makes sense now. random baseline for a vocab of 40,000 is ln(40000) ≈ 10.6. a model that only knows token frequency will settle somewhere around 7.5–8.0. that's exactly where IDK-1 was. not learning. done learning.
the fix is one line: remove the shuffle. instead, the merge now reads chunks from each source file in round-robin — one chunk from Wikipedia, one from CulturaX, alternating — so the two sources are mixed without breaking token order inside documents. sequences stay coherent. the model can actually learn language.
i've updated the data pipeline notebook. the corrected merge function is 40 lines and does what the original was supposed to do.
the run is cancelled. kaggle quota resets next week. when it does, i'll re-run the data pipeline with the fixed code, upload new train.bin and val.bin, and start pre-training from step 0.
the checkpoints from the cancelled run are saved — step_047500, step_050000, step_052500. i'm keeping them as a baseline. if you ever want to see what a model trained entirely on scrambled tokens looks like, those files exist.
one thing i'll say: the loss curve looked completely normal. flat for 30,000 steps, which is suspicious, but not obviously broken. if i hadn't dug into the data pipeline code looking for explanations, i would have run all 100,000 steps, gotten a model that can't generate coherent sentences, and spent another week trying to figure out why SFT wasn't working. catching it at 53k was better than catching it at 100k.
next run will be different. that's the only thing i know for certain.