What we've built and published.
Research.
Dataset + PaperSubmitted
PRENA
Indonesian Digital PR Dataset for Nano Language Model Training
1,966 Q&A pairs derived from real and synthetic Indonesian government press releases. Trained a 10.6M parameter nano language model using ChatML format. Targets Scopus-indexed publication.
NLPIndonesianDatasetFine-tuning
CorpusLive
Cleanesia
Clean Indonesian Text Corpus
70,587 documents and 36.5M words of clean Indonesian text across legal, encyclopedia, news, and web domains. Built because most existing Indonesian datasets are full of navigation noise and boilerplate.
CorpusIndonesianPre-trainingOpen Data