2026-06-20

why i went from 500M to 100M parameters (it's not what you think)

i had a 500M parameter indonesian language model called DFD-1. it was technically running. val loss was 3.2 at the halfway point — not great, but not dead either. i could have kept going.

i didn't. not because it failed. because of two things: the data was noisy, and the iteration cycle was too slow.

the noisy data problem: i used web crawl data without proper cleaning. the model learned patterns from garbage — navigation menus, repeated headers, short spam lines. output started showing severe repetition. the kind where it just loops the same phrase until you kill the generation. fixable, but you'd need to retrain from scratch with clean data anyway.

the iteration speed problem: 500M params on Kaggle free tier takes roughly 3 weeks for a full training run. that means if something's wrong — and something is always wrong — you find out 3 weeks later. 100M cuts that to about 1 week. three times faster to learn if your hypothesis was right.

so the tradeoff was: continue a slow loop with a model that needs a data fix anyway, or restart smaller, cleaner, faster. i restarted. IDK-1 is 100M parameters. same LLaMA-style architecture — GQA, SwiGLU, RMSNorm, RoPE theta=500k, logit soft-capping from Gemma 2. just smaller and trained on data i actually cleaned this time.

aggressive cleaning means: filter navigation noise, remove short-line spam, detect repetition patterns, MD5 deduplication. Wikipedia ID kept 56% of docs after cleaning. it's slower to process but the model isn't learning from menus and breadcrumbs.

DFD-1 still exists. the checkpoint is saved. if i ever want to continue it, i can. but right now IDK-1 finishes a full cycle before DFD-1 finishes half. that's the only math that matters when you're on free compute.

← back