Against LLM Maximalism

173 点作者 pmoriarty超过 1 年前

9 条评论

Spacy [0] is a state-of-art / easy-to-use NLP library from the pre-LLM era. This post is the Spacy founder's thoughts on how to integrate LLMs with the kind of problems that "traditional" NLP is used for right now. It's an advertisement for Prodigy [1], their paid tool for using LLMs to assist data labeling. That said, I think I largely agree with the premise, and it's worth reading the entire post.The steps described in "LLM pragmatism" are basically what I see my data science friends doing — it's hard to justify the cost (money and latency) in using LLMs directly for all tasks, and even if you want to you'll need a baseline model to compare against, so why not use LLMs for dataset creation or augmentation in order to train a classic supervised model?[0] <a href="https://spacy.io/" rel="nofollow noreferrer">https://spacy.io/</a>[1] <a href="https://prodi.gy/" rel="nofollow noreferrer">https://prodi.gy/</a>

评论 #37498110 未加载

评论 #37503848 未加载

PheonixPharts超过 1 年前

I personally still think most people (not necessarily the author) miss out on the biggest improvement LLMs have to offer: powerful embeddings for text representation for text classification.All of the prompting stuff is, of course, incredible, but the use of these models to create text embeddings of virtually any text document (from a sentence to a news paper article) allows for incredibly fast iteration on many traditional ML text classification problems.Multiple times I've taken cases where I have ~1,000 text documents with labels, run them through ada-002, and stuck that in a logistic model and gotten wildly superior performance to anything I've tried in the past.If you have an old NLP classification problem that you couldn't quite solve satisfactorily enough a few years ago, it's worth just mindlessly running it through the OpenAI embeddings API and sticking using those embeddings on your favorite off the shelf classifier.Having done NLP work for many years, it is insane to me to consider how many countless hours I spent doing tricky feature engineering to try to squeeze the most information I could out of the limited text data available, to realize it can now be replaced with about 10 minutes of programming time and less than a dollar.An even better improvement is the trivial ability to scale to real documents. It wasn't long ago that the best document models were just sums/averages of word embeddings.

评论 #37505533 未加载

alexvitkov超过 1 年前

Sorry if this is a bit ignorant, I don't work in the space, but if a single LLM invocation is considered too slow, how could splitting it up into a pipeline of LLM invocations which need to happen in sequence help?Same with reliability - you don't trust the results of one prompt, but you trust multiple piped one into another? Even if you test the individual components, which is what this approach enables and this article heavily advocates for, I still can't imagine that 10 unreliable systems, which have to interact with rach other, are more reliable than one.80% accuracy of one system is 80% accuracy.95% accuracy on 10 systems is 59% accuracy in total if you need all of them to work and they fail independently.

评论 #37498271 未加载

评论 #37497667 未加载

评论 #37498105 未加载

评论 #37497799 未加载

phillipcarter超过 1 年前

So I think this is an excellent post. Indeed, LLM maximalism is pretty dumb. They're awesome at specific things and mediocre at others. In particular, I get the most frustrated when I see people try to use them for tasks that need deterministic outputs and the thing you need to create is already known statically. My hope is that it's just people being super excited by the tech.I wanted to call this out, though, as it makes the case that to improve any component (and really make it production-worthy), you need an evaluation system:> Intrinsic evaluation is like a unit test, while extrinsic evaluation is like an integration test. You do need both. It’s very common to start building an evaluation set, and find that your ideas about how you expect the component to behave are much vaguer than you realized. You need a clear specification of the component to improve it, and to improve the system as a whole. Otherwise, you’ll end up in a local maximum: changes to one component will seem to make sense in themselves, but you’ll see worse results overall, because the previous behavior was compensating for problems elsewhere. Systems like that are very difficult to improve.I think this makes sense from the perspective of a team with deeper ML expertise.What it doesn't mention is that this is an enormous effort, made even larger when you don't have existing ML expertise. I've been finding this one out the hard way.I've found that if you have "hard criteria" to evaluate (i.e., getting the LLM to produce a given structure rather than an open-ended output for a chat app) you can quantify improvements using Observability tools (SLOs!) and iterating in production. Ship changes daily, track versions of what you're doing, and keep on top of behavior over a period of time. It's arguably a lot less "clean" but it's way faster, and because it's working on the real-world usage data, it's really effective. An ML engineer might call that some form of "online test" but I don't think it really applies.At any rate, there are other use cases where you really do need evaluations, though. The more important correct output is, the more it's worth investing in evals. I would argue that if bad outputs have high consequences, then maybe LLMs also aren't the right tech for the job, but that'll probably change in a few years. And hopefully making evaluations will be easier too.

评论 #37497491 未加载

评论 #37498090 未加载

mark_l_watson超过 1 年前

I agree with much of the article. You do need to take great care to make code with embedded LLM use modular and easily maintainable, and otherwise keep code bases tidy.I am a fan of tools like LangChain that bring some software order to using LLMs.BTW, this article is a blog hosted by the company who writes and maintains the excellent spaCy library.

评论 #37502948 未加载

评论 #37497219 未加载

sudb超过 1 年前

I've had a fair amount of success at work recently with treating LLMs - specifically OpenAI's GPT-4 with function calling - as modules in a larger system, helped along powerfully by the ability to output structured data.> Most systems need to be much faster than LLMs are today, and on current trends of efficiency and hardware improvements, will be for the next several years.I think here I disagree with the author here though, and am happy to be a technological optimist - if LLMs are used modularly, what's to stop us in a few years (presumably still hardware requirement costs, on reflection) eventually having small, fast specialised LLMs for the things that we find them truly useful/irreplaceable?

评论 #37498161 未加载

skybrian超过 1 年前

I don’t understand this heuristic and I think it might be a bit garbled. Any idea what the author meant? How do you get 1000?> A good rule of thumb is that you’ll want ten data points per significant digit of your evaluation metric. So if you want to distinguish 91% accuracy from 90% accuracy, you’ll want to have at least 1000 data points annotated. You don’t want to be running experiments where your accuracy figure says a 1% improvement, but actually you went from 94/103 to 96/103.

评论 #37498759 未加载

og_kalu超过 1 年前

I'll just say there's no guarantee training or fine-tuning a smaller bespoke model will be more accurate (Certainly though, it may be accurate enough). Minerva and Med-Palm are worse than GPT-4 for instance.

评论 #37498188 未加载

forward-slashed超过 1 年前

All of this is quite difficult without the DSL to explore and construct pipelines for LLMs. Current approaches are very slow in terms of iteration.