Lessons after a Half-billion GPT Tokens

512 pointsby lordofmoriaabout 1 year ago

41 comments

The team I work on processes 5B+ tokens a month (and growing) and I'm the EM overseeing that.Here are my take aways1. There are way too many premature abstractions. Langchain, as one of may examples, might be useful in the future but at the end of the day prompts are just a API call and it's easier to write standard code that treats LLM calls as a flaky API call rather than as a special thing.2. Hallucinations are definitely a big problem. Summarizing is pretty rock solid in my testing, but reasoning is really hard. Action models, where you ask the llm to take in a user input and try to get the llm to decide what to do next, is just really hard, specifically it's hard to get the llm to understand the context and get it to say when it's not sure.That said, it's still a gamechanger that I can do it at all.3. I am a bit more hyped than the author that this is a game changer, but like them, I don't think it's going to be the end of the world. There are some jobs that are going to be heavily impacted and I think we are going to have a rough few years of bots astroturfing platforms. But all in all I think it's more of a force multiplier rather than a breakthrough like the internet.IMHO it's similar to what happened to DevOps in the 2000s, you just don't need a big special team to help you deploy anymore, you hire a few specialists and mostly buy off the shelf solutions. Similarly, certain ML tasks are now easy to implement even for dumb dumb web devs like me.

评论 #40024387 未加载

评论 #40027110 未加载

评论 #40026808 未加载

评论 #40026860 未加载

评论 #40026116 未加载

评论 #40028796 未加载

评论 #40031099 未加载

评论 #40032648 未加载

评论 #40026989 未加载

评论 #40026661 未加载

Xenoamorphousabout 1 year ago

> We always extract json. We don’t need JSON modeI wonder why? It seems to work pretty well for me.> Lesson 4: GPT is really bad at producing the null hypothesisTell me about it! Just yesterday I was testing a prompt around text modification rules that ended with “If none of the rules apply to the text, return the original text without any changes”.Do you know ChatGPT’s response to a text where none of the rules applied?“The original text without any changes”. Yes, the literal string.

评论 #40026977 未加载

评论 #40027016 未加载

评论 #40026685 未加载

评论 #40026083 未加载

评论 #40046180 未加载

CuriouslyCabout 1 year ago

If you used better prompts you could use a less expensive model."return nothing if you find nothing" is the level 0 version of giving the LLM an out. Give it a softer out ("in the event that you do not have sufficient information to make conclusive statements, you may hypothesize as long as you state clearly that you are doing so, and note the evidence and logical basis for your hypothesis") then ask it to evaluate its own response at the end.

评论 #40024790 未加载

评论 #40024849 未加载

ein0pabout 1 year ago

Same here: I’m subscribed to all three top dogs in LLM space, and routinely issue the same prompts to all three. It’s very one sided in favor of GPT4 which is stunning since it’s now a year old, although of course it received a couple of updates in that time. Also at least with my usage patterns hallucinations are rare, too. In comparison Claude will quite readily hallucinate plausible looking APIs that don’t exist when writing code, etc. GPT4 is also more stubborn / less agreeable when it knows it’s right. Very little of this is captured in metrics, so you can only see it from personal experience.

评论 #40026278 未加载

评论 #40024929 未加载

评论 #40027290 未加载

评论 #40030346 未加载

评论 #40028002 未加载

评论 #40030491 未加载

chromanoidabout 1 year ago

GPT is very cool, but I strongly disagree with the interpretation in these two paragraphs:I think in summary, a better approach would’ve been “You obviously know the 50 states, GPT, so just give me the full name of the state this pertains to, or Federal if this pertains to the US government.”Why is this crazy? Well, it’s crazy that GPT’s quality and generalization can improve when you’re more vague – this is a quintessential marker of higher-order delegation / thinking.Natural language is the most probable output for GPT, because the text it was trained with is similar. In this case the developer simply leaned more into what GPT is good at than giving it more work.You can use simple tasks to make GPT fail. Letter replacements, intentional typos and so on are very hard tasks for GPT. This is also true for ID mappings and similar, especially when the ID mapping diverges significantly from other mappings it may have been trained with (e.g. Non-ISO country codes but similar three letter codes etc.).The fascinating thing is, that GPT "understands" mappings at all. Which is the actual hint at higher order pattern matching.

评论 #40027606 未加载

kromemabout 1 year ago

Tip for your 'null' problem:LLMs are set up to output tokens. Not to not output tokens.So instead of "don't return anything" have the lack of results "return the default value of XYZ" and then just do a text search on the result for that default value (i.e. XYZ) the same way you do the text search for the state names.Also, system prompts can be very useful. It's basically your opportunity to have the LLM roleplay as X. I wish they'd let the system prompt be passed directly, but it's still better than nothing.

msp26about 1 year ago

> But the problem is even worse – we often ask GPT to give us back a list of JSON objects. Nothing complicated mind you: think, an array list of json tasks, where each task has a name and a label.> GPT really cannot give back more than 10 items. Trying to have it give you back 15 items? Maybe it does it 15% of the time.This is just a prompt issue. I've had it reliably return up to 200 items in correct order. The trick is to not use lists at all but have JSON keys like "item1":{...} in the output. You can use lists as the values here if you have some input with 0-n outputs.

评论 #40025322 未加载

评论 #40025100 未加载

Civitelloabout 1 year ago

> Every use case we have is essentially “Here’s a block of text, extract something from it.” As a rule, if you ask GPT to give you the names of companies mentioned in a block of text, it will not give you a random company (unless there are no companies in the text – there’s that null hypothesis problem!). Make it two steps, first: > Does this block of text mention a company? If no, good you've got your null result. If yes: > Please list the names of companies in this block of text.

egonschieleabout 1 year ago

I have a personal writing app that uses the OpenAI models and this post is bang on. One of my learnings related to "Lesson 1: When it comes to prompts, less is more":I was trying to build an intelligent search feature for my notes and asking ChatGPT to return structured JSON data. For example, I wanted to ask "give me all my notes that mention Haskell in the last 2 years that are marked as draft", and let Chat GPT figure out what to return. This only worked some of the time. Instead, I put my data in a SQLite database, sent ChatGPT the schema, and asked it to write a query to return what I wanted. That has worked much better.

评论 #40028296 未加载

评论 #40027819 未加载

swalshabout 1 year ago

The being too precise reduces accuracy example makes sense to me based on my crude understanding on how these things work.If you pass in a whole list of states, you're kind of making the vectors for every state light up. If you just say "state" and the text you passed in has an explicit state, than fewer vectors specific to what you're searching for light up. So when it performs the soft max, the correct state is more likely to be selected.Along the same lines I think his /n vs comma comparison probably comes down to tokenization differences.

trolanabout 1 year ago

For a few uni/personal projects I noticed the same about Langchain: it's good at helping you use up tokens. The other use case, quickly switching between models, is a very valid reason still. However, I've recently started playing with OpenRouter which seems to abstract the model nicely.

评论 #40024150 未加载

dougb5about 1 year ago

The lessons I wanted from this article weren't in there: Did all of that expenditure actually help their product in a measurable way? Did customers use and appreciate the new features based on LLM summarization compared to whatever they were using before? I presume it's a net win or they wouldn't continue to use it, but more specifics around the application would be helpful.

评论 #40025542 未加载

FranklinMaillotabout 1 year ago

In my limited experience, I came to the same conclusion regarding simple prompt being more efficient than very detailed list of instructions. But if you look at OpenAI's system prompt for GPT4, it's an endless set of instructions with DOs and DONTs so I'm confused. Surely they must know something about prompting their model.

评论 #40026050 未加载

larodiabout 1 year ago

Agree largely with author, but this ‘wait for OpenAI to do it’ sentiment is not something valid. Opus for example is already much better (not only per my experience, but like… researchers evaluaiton). And even for the fun of it - try some local inference, boy. If u know how to prompt it you definitely would be able to run local for the same tasks.Like listening to my students all going to ‘call some API’ for their projects is really very sad to hear. Many startup fellows share this sentiment which a totally kills all the joy.

评论 #40026125 未加载

评论 #40025354 未加载

legendofbrandoabout 1 year ago

The finding on simpler prompts, especially with GPT4 tracks (3.5 requires the opposite).The take on RAG feels application specific. For our use-case where having details of the past rendered up the ability to generate loose connections is actually a feature. Things like this are what I find excites me most about LLMs, having a way to proxy subjective similarities the way we do when we remember things is one of the benefits of the technology that didn’t really exist before that opens up a new kind of product opportunity.

pamelafoxabout 1 year ago

I’ve also seen that GPTs struggle to admit when they dont know. I wrote up an approach for evaluating that here - <a href="http://blog.pamelafox.org/2024/03/evaluating-rag-chat-apps-can-your-app.html?m=1" rel="nofollow">http://blog.pamelafox.org/2024/03/evaluating-rag-chat-apps-c...</a>Changing the prompt didn't help, but moving to GPT-4 did help a bit.

Yacovlewisabout 1 year ago

Interesting piece!My experience around Langchain/RAG differs, so wanted to dig deeper: Putting some logic around handling relevant results helps us produce useful output. Curious what differs on their end.

评论 #40024068 未加载

pamelafoxabout 1 year ago

Lol, nice truncation logic! If anyone’s looking for something slightly fancier, I made a micro-package for our tiktoken-based truncation here: <a href="https://github.com/pamelafox/llm-messages-token-helper">https://github.com/pamelafox/llm-messages-token-helper</a>

eigenvalueabout 1 year ago

I agree with most of it, but definitely not the part about Claude3 being “meh.” Claude3 Opus is an amazing model and is extremely good at coding in Python. The ability to handle massive context has made it mostly replace GPT4 for me day to day.Sounds like everyone eventually concludes that Langchain is bloated and useless and creates way more problems than it solves. I don’t get the hype.

评论 #40024430 未加载

评论 #40024596 未加载

haolezabout 1 year ago

That has been my experience too. The null hypothesis explains almost all of my hallucinations.I just don't agree with the Claude assessment. In my experience, Claude 3 Opus is vastly superior to GPT-4. Maybe the author was comparing with Claude 2? (And I've never tested Gemini)

WarOnPrivacyabout 1 year ago

> We consistently found that not enumerating an exact list or instructions in the prompt produced better resultsNot sure if he means training here or using his product. I think the latter.My end-user exp of GPT3.5 is that I need to be - not just precise but the exact flavor of precise. It's usually after some trial and error. Then more error. Then more trial.Getting a useful result on the 1st or 3rd try happens maybe 1 in 10 sessions. A bit more common is having 3.5 include what I clearly asked it not to. It often complies eventually.

评论 #40024854 未加载

disqardabout 1 year ago

> This worked sometimes (I’d estimate >98% of the time), but failed enough that we had to dig deeper.> While we were investigating, we noticed that another field, name, was consistently returning the full name of the state…the correct state – even though we hadn’t explicitly asked it to do that.> So we switched to a simple string search on the name to find the state, and it’s been working beautifully ever since.So, using ChatGPT helped uncover the correct schema, right?

KTibowabout 1 year ago

I feel like for just extracting data into JSON, smaller LLMs could probably do fine, especially with constrained generation and training on extraction.

konstantinua00about 1 year ago

> Have you tried Claude, Gemini, etc?> It’s the subtle things mostly, like intuiting intention.this makes me wonder - what if the author "trained" himself onto chatgpt's "dialect"? How do we even detect that in ourselves?and are we about to have "preferred_LLM wars" like we had "programming language wars" for the last 2 decades?

aubanelabout 1 year ago

> I think in summary, a better approach would’ve been “You obviously know the 50 states, GPT, so just give me the full name of the state this pertains to, or Federal if this pertains to the US government.”Why not really compare the two options, author? I would love to see the results!

ilakshabout 1 year ago

I recently had a bug where I was sometimes sending the literal text "null " right in front of the most important part of my prompt. This caused Claude 3 Sonnet to give the 'ignore' command in cases where it should have used one of the other JSON commands I gave it.I have an ignore command so that it will wait when the user isn't finished speaking. Which it generally judges okay, unless it has 'null' in there.The nice thing is that I have found most of the problems with the LLM response were just indications that I hadn't finished debugging my program because I had something missing or weird in the prompt I gave it.

AtNightWeCodeabout 1 year ago

The UX is an important part of the trick that cons peeps that these tools are better than they are. If you for instance instruct ChatGpt to only answer yes or no. It will feel like it is wrong much more often.

littlestymaarabout 1 year ago

> One part of our pipeline reads some block of text and asks GPT to classify it as relating to one of the 50 US states, or the Federal government.Using a multi-billion tokens like GPT-4 for such a trivial classification task[1] is an insane overkill. And in an era where ChatGPT exists, and can in fact give you what you need to build a simpler classifier for the task, it shows how narrow minded most people are when AI is involved.[1] to clarify, it's either trivial or impossible to do reliably depending on how fucked-up your input is

sungho_about 1 year ago

I'm curious if the OP has tried any of the libraries that control the output of LLM (LMQL, Outliner, Guadiance, ...), and for those who have: do you find them as unnecessary as LangChain? In particular, the OP's post mentions the problem of not being able to generate JSON with more than 15 items, which seems like a problem that can be solved by controlling the output of LLM. Is that correct?

评论 #40025620 未加载

peter_d_shermanabout 1 year ago

>"Lesson 2: You don’t need langchain. You probably don’t even need anything else OpenAI has released in their API in the last year. Just chat API. That’s it.Langchain is the perfect example of premature abstraction. We started out thinking we had to use it because the internet said so. Instead, millions of tokens later, and probably 3-4 very diverse LLM features in production, and our openai_service file still has only one, 40-line function in it:def extract_json(prompt, variable_length_input, number_retries)The only API we use is chat. We always extract json. We don’t need JSON mode, or function calling, or assistants (though we do all that). Heck, we don’t even use system prompts (maybe we should…). When a gpt-4-turbo was released, we updated one string in the codebase.This is the beauty of a powerful generalized model – less is more."Well said!

Kiroabout 1 year ago

> We always extract json. We don’t need JSON mode,Why? The null stuff would not be a problem if you did and if you're only dealing with JSON anyway I don't see why you wouldn't.

mvkelabout 1 year ago

I share a lot of this experience. My fix for "Lesson 4: GPT is really bad at producing the null hypothesis"is to have it return very specific text that I string-match on and treat as null.Like: "if there is no warm up for this workout, use the following text in the description: NOPE"then in code I just do a "if warm up contains NOPE, treat it as null"

评论 #40024498 未加载

评论 #40024564 未加载

albert_eabout 1 year ago

> Are we going to achieve Gen AI?> No. Not with this transformers + the data of the internet + $XB infrastructure approach.Errr ...did they really mean Gen AI .. or AGI?

评论 #40024587 未加载

_pdp_about 1 year ago

The biggest realisation for me while making ChatBotKit has been that UX > Model alone. For me, the current state of AI is not about questions and answers. This is dumb. The presentation matters. This is why we are now investing in generative UI.

评论 #40025172 未加载

评论 #40024840 未加载

nprateemabout 1 year ago

Anyone any good tips for stopping it sounding like it's writing essay answers, and flat out banning "in the realm of", delve, pivotal, multifaceted, etc?I don't want a crap intro or waffley summary but it just can't help itself.

评论 #40032418 未加载

nealsabout 1 year ago

Do I need langchain if I want to analyze a large document of many pages?

评论 #40026194 未加载

satisficeabout 1 year ago

I keep seeing this pattern in articles like this:1. A recitation of terrible problems 2. A declaration of general satisfaction.Clearly and obviously, ChatGPT is an unreliable toy. The author seems pleased with it. As an engineer, I find that unacceptable.

评论 #40026214 未加载

评论 #40025398 未加载

评论 #40029375 未加载

评论 #40030510 未加载

ameliusabout 1 year ago

This reads a bit like: I have a circus monkey. If I do such and such it will not do anything. But when I do this and that, then it will ride the bicycle. Most of the time.

评论 #40031697 未加载

评论 #40032032 未加载

2099milesabout 1 year ago

Great take, insightful. Highly recommend.

gokabout 1 year ago

So these guys are just dumping confidential tax documents onto OpenAI's servers huh.

评论 #40034232 未加载

orbatosabout 1 year ago

Statements like this tell me your analysis is poisoned by misunderstandings: "Why is this crazy? Well, it’s crazy that GPT’s quality and generalization can improve when you’re more vague – this is a quintessential marker of higher-order delegation / thinking." No, there is no "higher-order thought" happening, or any at all actually. That's not how these models work.

41 comments

thisgoesnowhereabout 1 year ago

评论 #40024387 未加载

评论 #40027110 未加载

评论 #40026808 未加载

评论 #40026860 未加载

评论 #40026116 未加载

评论 #40028796 未加载

评论 #40031099 未加载

评论 #40032648 未加载

评论 #40026989 未加载

评论 #40026661 未加载

Xenoamorphousabout 1 year ago

评论 #40026977 未加载

评论 #40027016 未加载

评论 #40026685 未加载

评论 #40026083 未加载

评论 #40046180 未加载

CuriouslyCabout 1 year ago

评论 #40024790 未加载

评论 #40024849 未加载

ein0pabout 1 year ago

评论 #40026278 未加载

评论 #40024929 未加载

评论 #40027290 未加载

评论 #40030346 未加载

评论 #40028002 未加载

评论 #40030491 未加载

chromanoidabout 1 year ago

评论 #40027606 未加载

kromemabout 1 year ago

msp26about 1 year ago

评论 #40025322 未加载

评论 #40025100 未加载

Civitelloabout 1 year ago

egonschieleabout 1 year ago

评论 #40028296 未加载

评论 #40027819 未加载

swalshabout 1 year ago

trolanabout 1 year ago

评论 #40024150 未加载

dougb5about 1 year ago

评论 #40025542 未加载

FranklinMaillotabout 1 year ago

评论 #40026050 未加载

larodiabout 1 year ago

评论 #40026125 未加载

评论 #40025354 未加载

legendofbrandoabout 1 year ago

pamelafoxabout 1 year ago

Yacovlewisabout 1 year ago

评论 #40024068 未加载

pamelafoxabout 1 year ago

eigenvalueabout 1 year ago

评论 #40024430 未加载

评论 #40024596 未加载

haolezabout 1 year ago

WarOnPrivacyabout 1 year ago

评论 #40024854 未加载

disqardabout 1 year ago

KTibowabout 1 year ago

I feel like for just extracting data into JSON, smaller LLMs could probably do fine, especially with constrained generation and training on extraction.

konstantinua00about 1 year ago

aubanelabout 1 year ago

ilakshabout 1 year ago

AtNightWeCodeabout 1 year ago

littlestymaarabout 1 year ago

sungho_about 1 year ago

评论 #40025620 未加载

peter_d_shermanabout 1 year ago

Kiroabout 1 year ago

> We always extract json. We don’t need JSON mode,Why? The null stuff would not be a problem if you did and if you're only dealing with JSON anyway I don't see why you wouldn't.

mvkelabout 1 year ago

评论 #40024498 未加载

评论 #40024564 未加载

albert_eabout 1 year ago

> Are we going to achieve Gen AI?> No. Not with this transformers + the data of the internet + $XB infrastructure approach.Errr ...did they really mean Gen AI .. or AGI?

评论 #40024587 未加载

_pdp_about 1 year ago

评论 #40025172 未加载

评论 #40024840 未加载

nprateemabout 1 year ago

评论 #40032418 未加载

nealsabout 1 year ago

Do I need langchain if I want to analyze a large document of many pages?

评论 #40026194 未加载

satisficeabout 1 year ago

评论 #40026214 未加载

评论 #40025398 未加载

评论 #40029375 未加载

评论 #40030510 未加载

ameliusabout 1 year ago

This reads a bit like: I have a circus monkey. If I do such and such it will not do anything. But when I do this and that, then it will ride the bicycle. Most of the time.

评论 #40031697 未加载

评论 #40032032 未加载

2099milesabout 1 year ago

Great take, insightful. Highly recommend.

gokabout 1 year ago

So these guys are just dumping confidential tax documents onto OpenAI's servers huh.

评论 #40034232 未加载

orbatosabout 1 year ago