TechEcho

12 comments

Related to this, our min_p paper was ranked #18 out of 12000 submission at ICLR and got an oral:<a href="https://iclr.cc/virtual/2025/oral/31888" rel="nofollow">https://iclr.cc/virtual/2025/oral/31888</a>Our poster was popular:poster: <a href="https://iclr.cc/media/PosterPDFs/ICLR%202025/30358.png?t=1745327604.27015" rel="nofollow">https://iclr.cc/media/PosterPDFs/ICLR%202025/30358.png?t=174...</a>oral presentation (watch me roast yoshua bengio on this topic and then have him be the first questioner, 2nd speaker starting around 19:30 min mark. My slides for the presentation are there too and really funny.): <a href="https://iclr.cc/virtual/2025/session/31936" rel="nofollow">https://iclr.cc/virtual/2025/session/31936</a>paper: <a href="https://arxiv.org/abs/2407.01082" rel="nofollow">https://arxiv.org/abs/2407.01082</a>As one of the min_p authors, I can confirm that Top N sigma is currently the best general purpose sampler by far. Also, temperature can and should be scaled far higher than it is today. Temps of 100 are totally fine with techniques like min_p and top N sigma.Also, the special case of top_k = 2 with ultra high temperature (one thing authors recommend against near the end) is very interesting in its own right. Doing it leads to spelling errors every ~10th word - but also seems to have a certain creativity to it that's quite interesting.

评论 #43888420 未加载

orbital-decay11 days ago

One thing not said here is that samplers have no access to model's internal state. It's basic math applied to the output distribution, which technically carries some semantics but you can't decode it without being as smart as the model itself.Certain samplers described here like repetition penalty or DRY are just like this - the model could repeat itself in a myriad of ways, the only way to prevent all of them is better training, not n-gram search or other classic NLP methods. This is basically trying to plug every hole with a finger. How many fingers do you have?Hacking the autoregressive process has some some low-hanging fruits like Min-P that can make some improvement and certain nifty tricks possible, but if you're doing it to turn a bad model into a good one, you're doing it wrong.

评论 #43888135 未加载

评论 #43889081 未加载

smcleod11 days ago

I had a go at writing a bit of a sampling guide for Ollama/llama.cpp as well recently, open to any feedback / corrections - <a href="https://smcleod.net/2025/04/comprehensive-guide-to-llm-sampling-parameters/" rel="nofollow">https://smcleod.net/2025/04/comprehensive-guide-to-llm-sampl...</a>

neuroelectron11 days ago

Love this and the way everything is mapped out and explained simply really opens up the opportunity for trying new things, and where you can do that effectively.For instance, why not use whole words as tokens? Make a "robot" with a limited "robot dialect." Yes, no capacity for new words or rare words, but you could modify the training data and input data to translate those words into the existing vocabulary. Now you have a much smaller mapping that's literally robot-like and kind of gives the user an expectation of what kind of answers the robot can answer well, like C-3PO.

评论 #43888647 未加载

mdp202111 days ago

When the attempt is though to have the LLM output an "idea", not just a "next token", the selection over the logits vector should break that original idea... If the idea is complete, there should be no need to use sampling over the logits.The sampling, in this framework, should not happen near the output level ("what will the next spoke word be").

评论 #43888329 未加载

michaelgiba7 days ago

This is much more thorough, but here is an interactive post covering the related topic constrained sampling I put together a few weeks back:<a href="http://michaelgiba.com/grammar-based/index.html" rel="nofollow">http://michaelgiba.com/grammar-based/index.html</a>

amelius11 days ago

Would it be possible for the LLM model to do the tokenization implicitly? So instead of building a separate tokenizer, you just allow the use of any string of characters, then have a neural network that converts that into tokens, where the weights of that network are trained with the rest of the llm.

评论 #43891839 未加载

antonvs11 days ago

This is great! “Sampling” covers much more than I expected.

ltbarcly311 days ago

Calling things modern that are updates to techniques to use technologies only invented a few years ago is borderline illiterate. Modern vs what, classical LLM sampling?

评论 #43894523 未加载

评论 #43890215 未加载

评论 #43889704 未加载

simonw11 days ago

This is a really useful document - the explanations are very clear and it covers a lot of ground.Anyone know who wrote it? It's not credited and it's pubished on a free Markdown pastebin.The section on DRY - "repetition penalties" - was interesting to me. I often want LLMs to deliberately output exact copies of their input. When summarizing a long conversation for example I tend to ask for exact quotes that are most illustrative of the points being made. These are easy to fact check later by searching for them in the source material.The DRY penalty seems to me that it would run counter to my goal there.

评论 #43889309 未加载

blt11 days ago

This is pretty interesting. I didn't realize so much manipulation was happening after the initial softmax temperature choice.

评论 #43888133 未加载

gitroom11 days ago

Man, there's always way more to this stuff than I first guess. Makes me wonder - you think better sampling really fixes model limits, or is it just kind of patching over deeper problems?

12 comments

Der_Einzige11 days ago

评论 #43888420 未加载

orbital-decay11 days ago

评论 #43888135 未加载

评论 #43889081 未加载

smcleod11 days ago

neuroelectron11 days ago

评论 #43888647 未加载

mdp202111 days ago

评论 #43888329 未加载

michaelgiba7 days ago

amelius11 days ago

评论 #43891839 未加载

antonvs11 days ago

This is great! “Sampling” covers much more than I expected.

ltbarcly311 days ago

Calling things modern that are updates to techniques to use technologies only invented a few years ago is borderline illiterate. Modern vs what, classical LLM sampling?

评论 #43894523 未加载

评论 #43890215 未加载

评论 #43889704 未加载

simonw11 days ago

评论 #43889309 未加载

blt11 days ago

This is pretty interesting. I didn't realize so much manipulation was happening after the initial softmax temperature choice.

评论 #43888133 未加载

gitroom11 days ago

Man, there's always way more to this stuff than I first guess. Makes me wonder - you think better sampling really fixes model limits, or is it just kind of patching over deeper problems?

Dummy's Guide to Modern LLM Sampling

12 comments

Dummy's Guide to Modern LLM Sampling

12 comments