I’ve been thinking about adding this to my “Fuck it, I’ll do it myself” / SHTF pile. I have a spare 10-15GB for a good selection of basic articles (across sciences, history, pop culture trivia etc).
https://get.kiwix.org/en/solutions/hotspots/content-bundles/
https://get.kiwix.org/en/solutions/hotspots/imager-service/
There’s something inherently cool about having wikipedia in a box (yes, you’d likely need to refresh it once a year) but I’ve never heard of anyone actually self hosting a Kiwix instance.


Do you actually train the LLM or use RAG? I have been looking for a local LLM + Wikipedia RAG solution for a while now.
For now I just have kiwix-serve + searxng doing a simple search but the Kiwix search is…questionable.
Somewhere in my documents, I have a scoped ticket for how to use kiwix as the source for the LLM to pull information directly from, populate its answer and naturally respond to question, without word-vomiting a wiki entry complete. I can dig that up for you; it’s actually why I’m looking at kiwix (back burner project for now).
PS: You’re aware of what LLM-wiki is? That might suit your purposes better, if your corpus is bespoke and updating. Works nicely.
https://tinyurl.com/llmwiki