Prototyping an application with an LLM is pretty straightforward task these days: try out a few prompts and paste in some context and see if it seems to work.<p>After trying to build something more substantial (a script that takes a text description and attempts to scrape that info out of a collection of PDFs/websites), I realized there are a number of annoyances with getting this out of the prototype stage:<p>* Parsing prompt responses to ensure they match my expected schema (e.g. I want XPATH selectors, sometimes the model hallucinates a DOM id)<p>* Hacks to avoid long context windows (especially if the context isn't easily vector-searchable, e.g. a DOM tree)<p>* Retry logic<p>* Measuring how well the system is doing over multiple examples<p>AI Twitter is full of examples of how LLMs, AutoGPT, etc. are cure-alls, but what are some of the practical issues that actually come up when you try to build on top of these yourselves?
One practical problem with the OpenAI API is you'll get 'server busy' responses pretty regularly, making your app less reliable.<p>Another problem is it will typically not ask clarifying questions, so it can often make a wrong assumption about ambiguous wording or missing information in your prompt, but not tell you.<p>It seems to have an inexplicable penchant for certain outputs, for example I asked GPT-4 to rate a myriad of things 1-10 for relevance, and a third of its ratings were exactly 7.8.<p>If you're dealing with vector-based context, there are even more issues: for example it fails on negations and doesn't know about newer words - i.e. "find me all the competitors to Pinecone" would not give you good results, because the embeddings model doesn't know what Pinecone is and the embeddings aren't similar to an actual competitor like Milvus.