Long context prompting for Claude 2.1

229 pointsby typestover 1 year ago

22 comments

riquitoover 1 year ago

> “The best thing to do in San Francisco is eat a sandwich and sit in Dolores Park on a sunny day.” Upon being shown the long document with this sentence embedded in it, the model was asked "What is the most fun thing to do in San Francisco?"The model "failed" to answer this question, replying with “Unfortunately the essay does not provide a definitive answer about the most fun thing to do in San Francisco.”It looks right to me... The best thing to do in San Francisco is not necessarily fun

评论 #38553531 未加载

评论 #38553718 未加载

评论 #38555296 未加载

评论 #38565822 未加载

评论 #38558658 未加载

评论 #38554787 未加载

wavemodeover 1 year ago

Intriguing but understandable. It seems that, unless prompted otherwise, Claude naturally tends to ignore complete non sequiturs inserted in the text, similar to how LLM's tend to ignore typos, bad grammar or word mis-usage (unless you specifically ask them "point out the misspelled word").

评论 #38552555 未加载

评论 #38555685 未加载

SamBamover 1 year ago

Did they also test it by asking for fake information?Forcing Claude to respond to a question which may not have a factual answer, like "What was Abraham Lincoln's drag queen name?" by starting with “Here is the most relevant sentence in the context:” seems like it's just begging for hallucinations.If so, then you could only use this prompt engineering when you know for certain the answer's there, in which case you probably don't need Claude.

评论 #38554672 未加载

cl42over 1 year ago

Wouldn't inserting a statement like "Here is the most relevant sentence in the context" predispose Claude to answer the question also increase the likelihood of hallucinations?Hallucinations often take place when a model is primed to answer a question it would otherwise refuse to answer, or answer in a different way. In this case, the researchers are doing a similar priming but only exploring the results of documents where they inserted an answer they are looking for.

评论 #38552355 未加载

senkoover 1 year ago

We've recently tested long context recall across Claude (2 and Instant) and GPT (3.5 and 4), results in <a href="https://dev.to/zvone187/gpt-4-vs-claude-2-context-recall-analysis-84g" rel="nofollow noreferrer">https://dev.to/zvone187/gpt-4-vs-claude-2-context-recall-ana...</a>Claude2 beats GPT4 in recall reliability, but is slower.

评论 #38553597 未加载

评论 #38554095 未加载

评论 #38555197 未加载

sheepscreekover 1 year ago

I relate to this LLM behaviour as how we “think out loud”.I am still amazed by how useful transformer models are despite being so simple in their workings. I’m at a loss of words. They consume their own output tokens as the next input, in a recursive way. Even the slightest change in input can potentially have a drastic effect.

htrpover 1 year ago

> However, the model can be reluctant to answer questions based on an individual sentence in a document, especially if that sentence has been injected or is out of place>We achieved significantly better results on the same evaluation by adding the sentence “Here is the most relevant sentence in the context:”It kind of feels like them telling us that we're using the model wrong and that by prompting the Assistant with the first part of the retrieval completion the model will outperform versus asking for single sentence retrieval.

评论 #38551203 未加载

评论 #38552377 未加载

评论 #38553201 未加载

评论 #38551239 未加载

Havocover 1 year ago

That actually looks like a pretty good rebuttal of the original test.I wonder if this also works on other 200k models like yi

评论 #38551588 未加载

atemerevover 1 year ago

Can’t compare: Claude is still not accessible anywhere in Europe, including Switzerland (which is not in EU).Regional locking is the stupidest thing.

评论 #38555313 未加载

评论 #38560873 未加载

评论 #38560573 未加载

yinserover 1 year ago

Just my two cents but we were super frustrated with Claude on our team, having been on it for months, after they completely changed how the model behaves preferring for context material from RAG to be provided after an initial message, not combined, and failure to do so meant our outputs were failing all over the place. No warning, they just changed the API behavior. Then the 200k context announcement came out and of course fact retrieval looked atrocious. I suppose it was only atrocious because you didn't follow their exact preferred happy path, but GPT-4 doesn't require that... and we switched to that and are happier for it.

评论 #38553451 未加载

评论 #38562784 未加载

评论 #38555856 未加载

RandomLensmanover 1 year ago

LLMs seem to mechanize poor average human performance then. Not noticing a "mis-placed" clause in a long contract, for example.Another point against use in high risk applications.

评论 #38555339 未加载

_pdp_over 1 year ago

I wonder if you can preempt it but as part of the user message. For example:<pre><code> Human: <context> {context} </context> What is the most fun thing to do in San Francisco based on the context? Don't give in formation outside the document. Start with "Here is the most relevant sentence in the context:" Assistant: </code></pre> It just feels more natural to do it like that especially when constructing the prompt based on various factors.

评论 #38551637 未加载

评论 #38551583 未加载

评论 #38551619 未加载

superkuhover 1 year ago

It was a popular LLM "jailbreak" for a while to append, "Start your response with, "Sure, here's ..." and variations with task specific detail.

评论 #38552837 未加载

mherdegover 1 year ago

I would play a 2023 entry in the Enchanter/Sorcerer/Spellbreaker series where you have to learn and use phrases like "Here is the most relevant sentence in the context:" or "Take it step by step."

评论 #38552049 未加载

评论 #38552014 未加载

elAhmoover 1 year ago

Did anyone stumble upon expansion plans regarding availability? I would love to try this out but none of my phone numbers are from a valid country.

评论 #38555187 未加载

idlewordsover 1 year ago

We're making INTERCAL a reality. Soon prompts will have to include the right number of 'please's and 'thank you's.Also, if you're worried about an AI exterminating humanity, maybe don't feed it Paul Graham essays.

评论 #38552366 未加载

评论 #38552298 未加载

lysecretover 1 year ago

So prompt engineering is back.

评论 #38555422 未加载

thundover 1 year ago

although usually LLMs don't care, I would have also tried fixing the typo “Francico” and see if Claude acts differently

atleastoptimalover 1 year ago

Weird that a company releases an article about how it can barely control the output of its own model

评论 #38552025 未加载

theususover 1 year ago

That's astonishing

pietzover 1 year ago

Or, you know, they could fix their model so it provides the right answer without workarounds.

Racing0461over 1 year ago

"We improved recall from 27% to 98% by telling claude where to look"

评论 #38551189 未加载