Dataset

Cleanesia

Clean Indonesian text corpus. Open, free, and built for training language models — not scraped from navigation bars and cookie banners.

70.6Kdocuments
36.5Mwords
47.4Mest. tokens
Domains
Web61,594 docs
Encyclopedia8,000 docs
News732 docs
Legal261 docs
Forumcoming soon
Literaturecoming soon
APIhttps://api.deflated.xyz
GET/api/info

Corpus statistics — total docs, words, tokens, breakdown by domain.

GET/api/sample

Random documents. Optional: ?domain=legal&n=5

GET/api/search

Full-text search. Required: ?q=peraturan — Optional: &limit=10

GET/api/download

HuggingFace dataset links for bulk download.

Try it
curl "https://api.deflated.xyz/api/sample?n=1"

curl "https://api.deflated.xyz/api/search?q=peraturan+daerah&limit=5"

curl "https://api.deflated.xyz/api/info"
Download

Full dataset available on HuggingFace. License: CC BY 4.0.

HuggingFace dataset upload in progress — check back soon.

Built under Deflated. Part of the Indonesian AI stack — alongside IDK-1 and Nala.