Dataset
Cleanesia
Clean Indonesian text corpus. Open, free, and built for training language models — not scraped from navigation bars and cookie banners.
70.6Kdocuments
36.5Mwords
47.4Mest. tokens
Domains
Web61,594 docs
Encyclopedia8,000 docs
News732 docs
Legal261 docs
Forumcoming soon
Literaturecoming soon
API
https://api.deflated.xyzGET
/api/infoCorpus statistics — total docs, words, tokens, breakdown by domain.
GET
/api/sampleRandom documents. Optional: ?domain=legal&n=5
GET
/api/searchFull-text search. Required: ?q=peraturan — Optional: &limit=10
GET
/api/downloadHuggingFace dataset links for bulk download.
Try it
curl "https://api.deflated.xyz/api/sample?n=1" curl "https://api.deflated.xyz/api/search?q=peraturan+daerah&limit=5" curl "https://api.deflated.xyz/api/info"
Download
Full dataset available on HuggingFace. License: CC BY 4.0.
HuggingFace dataset upload in progress — check back soon.
Built under Deflated. Part of the Indonesian AI stack — alongside IDK-1 and Nala.