2026-06-23

i built a clean indonesian text corpus because every existing one is broken

when i was debugging why IDK-1's loss was plateauing, i looked at the CulturaX data for the first time.

it was bad. not 'slightly noisy' bad. navigation menus, breadcrumbs, cookie consent text, repeated headers — all treated as training data. the model wasn't learning indonesian language. it was learning the structure of a poorly-scraped website. then i found the shuffle bug on top of that and cancelled the whole run.

but even after fixing the shuffle bug, the underlying problem remains: the indonesian NLP datasets that exist are either low quality, behind a license wall, or hosted in a way that makes them annoying to actually use. i needed clean indonesian text. i couldn't find one i trusted. so i built it.

cleanesia. the name is clean + indonesia. simple.

the stack: MongoDB Atlas for storage, Vercel serverless for the API, Python scripts for importing and cleaning. phase 1 sources: JDIH legal documents (from the same corpus i built for Nala) and Wikipedia ID. legal documents because they're formal, well-structured indonesian. Wikipedia because it's broad coverage.

the numbers right now: 8,993 documents. three domains — 8,000 Wikipedia ID articles, 732 press releases from PRENA, 261 legal documents from JDIH. about 9 million words total.

the API is live at api.deflated.xyz. four endpoints: /api/info for stats, /api/sample for random documents, /api/search for full-text search, /api/download for HuggingFace links. everything is GET, no auth, CC BY 4.0.

the cleaning is real. each document goes through navigation noise detection (regex patterns for breadcrumbs and copyright lines), minimum length filtering, and MD5 deduplication. Wikipedia kept about 56% of articles after filtering. the legal documents were already clean — JDIH formats are consistent enough that the main work was parsing the document headers and removing page markers.

what i actually want from this: for IDK-1 training, i need clean data where the token sequences represent real sentences. the shuffle bug aside, the main lesson from the cancelled run is that garbage in means garbage out, and you won't always see it in the loss curve until it's too late. cleanesia is the data i'll use for the next IDK-1 run — alongside cleaning the CulturaX pipeline properly.

phase 2 is news data — open sources like Common Crawl Indonesian subset and CC-100. phase 3 is forum text (informal indonesian) and literature. the goal is domain coverage — not just formal indonesian, but how indonesians actually write.

if you're building something in the indonesian NLP space, the API and dataset are free to use. that's the point.

← back