> It seems that OpenAI has brief time periods during which their RAG system is able to work properly<p>This is the problem of running tests behind a huge black box over the internet: are you even measuring properly? Who knows.<p>OpenAI has been under huge load lately so it's possible (even likely) that whatever microservice the actual LLM was connecting to for retrieval was just failing miserably and not properly reporting the failure.<p>I've seen this myself even for a single document: GPT was failing to retrieve some obvious data and it just didn't properly communicate it to me, just told me something along the lines of "the info is not there" or spitting plausibly-sounding results from the LLM inference instead of the info that was actually in the PDF. After retrying (and a long wait) it worked -- I can see why this effect can be exacerbated when retrieving from multiple files, which is probably implemented as multiple calls over to the microservice handling the retrieval, resulting in more chances to fail retrieval.
I'm getting results that roughly match your findings<p>It's weird, we tend to think of the OpenAI people as programming gods now, but the kr api seems pretty obviously botched at the moment (since I can run use third party tools that use gpt4 with another embedding database, along with the same data, and get much better results)