I'm building a LLM RAG QA bot for my company, a financial institution. Right now I know the 'basic' building blocks, e.g. prompt engineering, RAG, vector db, eval, etc. Funny enough the first challenge I encounter is to curate and manage all types of docs, e.g.:
* email chains
* teams recording transcripts
* confluence pages
* pdf manuscripts<p>These can be ever-evolving and may hook up with periodic delta updates, manual sync, add/remove, etc. And I'm trying to figure out if there's a way to manage these docs/texts properly. Basically, I think I would need a system to store these files, their metadata, etc, and provide a web UI for people to manage them. Then these blob of texts will go through frameworks like langchain/LlamaIndex and be cleaned/chunked into vector db, and different chunking strategies can be A/B tested while other people maintain this ever-growing docs system.<p>Any suggestions are welcomed. I've tried some all-in-one frameworks but so far my experience are lackluster. Also, my company due to compliance constraints cannot use cloud-based solutions, so it has to be either open-source local-deployed, or developed locally.