I've got a big corpus of textual data (10M+ tokens) from our corporate Wiki that I'd like to plug into an LLM that our customers can use. The trouble is, I don't know how best to do that.<p>- I can't really train an LLM myself. That's a huge lift.<p>- I can use an off-the-shelf model, like GPT-3.5-Turbo, and then use their Fine Tuning API to improve the model, query-by-query. But that's not a great interface for incorporating a big block of semi-structured data.<p>- I could use RAG (Retrieval-Augmented Generation), so basically a clever lookup algorithm to find the right place in my textual dataset to then load into the context window and use for generation. But not all of my data lends itself cleanly to RAG.<p>I can use the OpenAI API to generate embeddings from my dataset, but I don't know how to then use them to augment a model or otherwise use the generated embeddings for useful search and/or generation.<p>How are you guys plugging your large textual datasets into LLMs? Any advice would be much appreciated.
My general rule of thumb at the moment is:<p>- Tasks requiring knowledge synthesis between multiple different datapoints (in your case wiki pages) should be fine-tuned so the model's able to do some basic chain of reasoning to reach a new conclusion. Often just fine tuning on the text itself and not (query, text) pairs are sufficient for basic memorization and therefore lookup. The con with this approach is you don't know where the original information came from - the pretrained model, your wiki, or if it's just a hallucination.<p>- Tasks that require some source of truth for reliability benefit more from the RAG approach, since the summarization layer can explicitly reference the input sources. I use markdown annotations for this output format since it provides inline and easily parsable references to the retrieved content.<p>RAG is effectively a layer on top of classic information retrieval, like what happens with search engines. The question of how to do the retrieval itself could be semantic embedding-based like what you get back from OAI, tfidf-based, or some other heuristic approximation.<p>If you have a few initial queries of what people are looking for, I'd start simple with a jupyter notebook, the OAI fine-tuning API, and numpy before jumping to off-the-shelf solutions that promise to solve this problem for you. It will build more of an intuition of your data and the tradeoffs required.
check out <a href="https://klu.ai" rel="nofollow noreferrer">https://klu.ai</a> – we built it for this reason – sign up, book some time, and I'll help you however I can