What we've built and published.

Research.

Dataset + PaperSubmitted

PRENA

Indonesian Digital PR Dataset for Nano Language Model Training

1,966 Q&A pairs derived from real and synthetic Indonesian government press releases. Trained a 10.6M parameter nano language model using ChatML format. Targets Scopus-indexed publication.

NLPIndonesianDatasetFine-tuning
CorpusLive

Cleanesia

Clean Indonesian Text Corpus

70,587 documents and 36.5M words of clean Indonesian text across legal, encyclopedia, news, and web domains. Built because most existing Indonesian datasets are full of navigation noise and boilerplate.

CorpusIndonesianPre-trainingOpen Data