i trained the first sundanese language model from scratch and it took 83 minutes
sundanese has about 40 million native speakers. it's the second most spoken language in indonesia. and as of today, there was no dedicated language model trained on it. not a fine-tune. not a small one. nothing.
ajnyana-1 is the first one.
the name is from old sundanese — it means wisdom or knowledge. i liked the contrast with IDK-1 (I Don't Know). there's an arc there: IDK-1 is the big indonesian model, the ambitious one. ajnyana is smaller, focused, from a place of genuine curiosity about what even happens when you train a tiny model on a single regional language.
the data problem was real. sundanese NLP basically doesn't exist as a field. the two sources i could find were Wikipedia Sundanese (30k articles, clean, formal) and CC-100 Sundanese (238k docs, noisy, lots of SEO spam and code-switching with Indonesian). after QC filtering: 266k documents, 61M words, roughly 122M tokens after BPE encoding. not a lot. chinchilla optimal for a 9M param model is about 180M tokens. we trained on 68% of what we should have.
the tokenizer is BPE 16K vocab trained specifically on the sundanese corpus. byte-level, so no UNK tokens. the 2.00 tokens-per-word ratio tells you the vocab fits the language reasonably well — not stretching every word across 5 subwords.
the architecture is nanoGPT. karpathy's implementation. standard multi-head attention, pre-norm layernorm, GELU, learned positional embeddings, weight tying between the embedding and lm_head. 6 layers, 4 heads, dim 256. 8.95M parameters total. nothing fancy — the whole point was to start simple and see what the data actually teaches the model.
training: 10,000 steps on a Kaggle T4, AdamW with cosine LR decay from 1e-3 to 1e-4, 500 step warmup, batch size 64, context 512. 83 minutes total. final val loss: 3.7538. perplexity: 52.20.
what does it actually generate? syntactically valid sundanese. the outputs look like sundanese sentences — word order, particles, morphology roughly right. semantically? incoherent. it strings together plausible phrases that don't mean anything together. that's expected at 9M params and 122M tokens. the model learned the surface structure of the language. it hasn't learned to think in it.
but here's the thing: that's already more than existed before. the corpus is on huggingface. the model is on huggingface. the training notebooks are on github. if someone wants to fine-tune it, extend the corpus, or use it as a baseline — it's there.
what comes next for ajnyana: probably nothing immediately. this was a side project to the main IDK-1 work. but the path forward is obvious — more data (the CC-100 quality filter needs work), a bigger model (30M params would be the next step), and eventually instruction tuning on sundanese text. 40 million speakers deserve better than zero.