TechEcho

1 comment

kaiokendevalmost 2 years ago

It seem the site does not allow the in-page anchor:<a href="https://kaiokendev.github.io/til#extending-context-to-8k" rel="nofollow noreferrer">https://kaiokendev.github.io/til#extending-context-to-8k</a>Im not sure what is the rule in linking reddit but here is the related thread [0]Llama.cpp has also started implementing it [1] and any help will be appreciated Im sure.Also, here is what I wrote in the thread for some further clarification:> I suspected there was no way that I was the first to try something like this. After getting into contact with Ofir Press [2], he offered some kind words and pointed me in the direction of Vision Transformers. It turns out that conditional positional encoding does something very similar.<a href="https://arxiv.org/abs/2102.10882" rel="nofollow noreferrer">https://arxiv.org/abs/2102.10882</a>> We propose a conditional positional encoding (CPE) scheme for vision Transformers. Unlike previous fixed or learnable positional encodings, which are pre-defined and independent of input tokens, CPE is dynamically generated and conditioned on the local neighborhood of the input tokens. As a result, CPE can easily generalize to the input sequences that are longer than what the model has ever seen during training. Besides, CPE can keep the desired translation-invariance in the image classification task, resulting in improved performance.While RoPE cannot be swapped out for CPE, the technique of stretching the sinusoidal w.r.t. the input sequence length is actually very closely related to the proven method for length extrapolation in Vision Transformers. With this, I think further gains can be had from varied sequence lengths during finetuning where the scaling factor is dependent on the sequence length (e.g. 0.5 when n = 4096, 0.64 when n = 3172). This could teach the model how to be sequence invariant during test time, and might be a possible method for improving the effect and achieving 16K and beyond.I am curious what other enhancements are present in Transformer variants that are waiting to be incorporated into local language models.[0]: <a href="https://www.reddit.com/r/LocalLLaMA/comments/14fgjqj/a_simple_way_to_extending_context_to_8k/?utm_source=share&utm_medium=mweb" rel="nofollow noreferrer">https://www.reddit.com/r/LocalLLaMA/comments/14fgjqj/a_simpl...</a>[1]: <a href="https://github.com/ggerganov/llama.cpp/discussions/1965">https://github.com/ggerganov/llama.cpp/discussions/1965</a>[2]: <a href="https://twitter.com/OfirPress" rel="nofollow noreferrer">https://twitter.com/OfirPress</a>

1 comment

kaiokendevalmost 2 years ago

Extending Context to 8K

1 comment

Extending Context to 8K

1 comment