It's crazy how fast this field moves ! I basically live on HN and between 30%-40% of the terms or metrics I never even heard of (or maybe I just glanced over in the past)<p>I love articles like these, and how they are able to bring me up to speed (at least to some degree) on the "new paradigm" that is AI/LLM.<p>As a coder I cannot say what the future will look like (binary views) but I can easily believe that in the future we will have MORE AI/LLM and not LESS AI/LLM thus getting up to speed (at least on the acronyms and core theory and concepts) is well worthwhile.<p>Very Good Article !
> 65 min read<p>That's when you know it's going to be amazing. Single best narrative-form overview of the current state of integrating LLM's into applications and the challenges encountered I've read so far. This is fantastic and must have required an incredible amount of work. Massive kudos to the author.
I'm starting to see a lot of products in "beta" that seem to be little more than a very thin wrapper around ChatGPT. So thin that it is trivial to get it to give general responses.<p>I recently trialed an AI Therapy Assistant service. If I stayed on topic, then it stayed on topic. If I asked it to generate poems or code samples, it happily did that too.<p>It felt like they rushed it out without even considering that someone might ask it non-therapy related questions.
Evals is not suitable for evaluating LLM applications such as RAG, etc because one has to evaluate on their own data where no golden test data exists, and techniqus used have poor correlation with human judgement.
We have build RAGAS framework for this <a href="https://github.com/explodinggradients/ragas">https://github.com/explodinggradients/ragas</a>
For those who don't have 65 minutes, if you write software you are probably familiar with the concepts of evals, caching, guardrails, defensive UX, and collecting user feedback, none of which are really unique to LLMs. The other two items are "fine-tuning" which just means nudging the LLM to be better at responding a certain way, and "RAG" which is a new acronym that just means using the input to look things up in a database first and concatenate them into the prompt so the LLM uses it as part of the context for token generation.
Good note on a design pattern for a LLM based product. The biggest focal points will be if we see evolution of frameworks that tackle the hard parts here.<p>Evals, RAG, Guardrails often times require recursive calls to LLM's or other fine-tuned systems which are based on LLM's.<p>I would like to see LLM's and models condensed and bundled up into more singular task trained models - much more beneficial versus system design on using LLM's for applications.<p>This seems like we are applying traditional system design patterns for using LLM's in practice in apps.
This is fantastic! I found myself nodding along in many places. I've definitely found in practice that evals are critical to shipping LLM-based apps with confidence. I'm actually working on an open-source tool in this space: <a href="https://github.com/openpipe/openpipe">https://github.com/openpipe/openpipe</a>. Would love any feedback on ways to make it more useful. :)
"hybrid retrieval (traditional search index + embedding-based search) works better than either alone."<p>- any references for how this hybrid retrieval is done?
I've been working on getting LLM-based features out in a production environment for the past few months. This article is absolute gold. Does a great job of capturing several learnings that I think a lot of us are dealing with in silos.
Most of these products are just trivial wrappers around the behemoths, wrappers whose creators either can't recognize or don't even use half the patterns rattled off here.<p>I'd be more interested in the sales and marketing patterns being employed to hawk the same rebranded wrappers over and over. Ultimately, that's what's really going to contribute most to the success of all these startups.
I'm sorry but from a _practical_ standpoint, it feels like mostly fluff. Someone was advertising today on a HN hiring post that they would create a basic chatbot for a specific set of documents for $15,000. This feels like the type of web page that person would use to confuse a client into thinking that was a fair price.<p>Practically speaking the starting point should be things like the APIs such as OpenAI or open source frameworks and software. For example, llama_index
<a href="https://github.com/jerryjliu/llama_index">https://github.com/jerryjliu/llama_index</a>. You can use something like that or another GitHub repo built with it to create a customized chatbot application in a few minutes or a few days. (It should not take two weeks and $15,000).<p>It would be good to see something detailed that demonstrates an actual use case for fine tuning. Also, I don't believe that the academic tests are appropriate in that case. If you really were dead set on avoiding a leading edge closed LLM, and doing actual fine-tuning, you would want a person to look at the outputs and judge them in their specific context such as handling customer support requests for that system.
Oh God, the marketing of barely-understood tech-crafting recipes into new corporate jargon has turned into new acronyms to obfuscate the jargon, and is now accelerating even faster than AI.<p>Did you miss the NFT train? Have you ever asked yourself if this is what you should be doing with your life?<p>Just speaking as a guy who actually writes logic and code, rather than like, coming up with incantations and selling horseshit.