TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Show HN: I Used Llama-70B Logprobs for Better, Cheaper and Faster Chunking

1 pointsby ghita_6 months ago
The LlamaChunk algorithm is simple: We pick a special token that we know is not in the corpus, e.g. &quot;段&quot;.<p>We pick the character &quot;段&quot; because 1. tiktoken always encodes it to exactly one token, 2. it is not in the corpus, and 3. it means &quot;section&quot; in Chinese.<p>Then, we ask Llama to repeat the User&#x27;s message with the &quot;段&quot; token sprinkled throughout. And just like that, it&#x27;s perfect out of the box! It correctly used the &quot;段&quot; character to mark the end of every section.<p>BUT if you&#x27;ve ever worked with LLM&#x27;s, you know that input tokens are processed almost instantly, and output tokens take an eternity to generate. A naïve method is to simply wait for the LLM to repeat the entire python code, inserting &quot;段&quot; throughout.<p>However, by inferencing llama locally, we have a vastly more efficient way of doing this! We can simply pass in the entire paragraph, and check the logprobs to see the probability that Llama wanted to output a &quot;段&quot; token at that location!<p>Done! The high logprob values, clearly indicate the locations that we should chunk! And, this is only possible because we have direct low-level access to Llama 3.1 70B.<p>Of course, there is a caveat. Because there are no output tokens, Llama can no longer see its own line breaks. Thus, as the text gets longer, it loses the willpower to continue to want to output &quot;段&quot;<p>But, we can simply normalize by this decaying curve in order to to fix this. And now, we&#x27;re ready to split any type of document, without having to resort to regex or manually created rules.<p>Processing 450,000 characters took about 15 minutes on an A100. However, ~80% of that time was saving and loading the KVCache (Which can be done instantly if written in C++, rather than Python). So, we can expect that it would take 3 minutes per 450,000 characters if done optimally. Or, 7 MTokens per hour.<p>In terms of accuracy, LlamaChunk has higher recall AND precision, than a naïve chunking method and than semantic chunking (Which uses embeddings to detect sentence split boundaries, and still requires a good regex-based sentence splitter).<p>---<p>One thing you might wonder: What if the ideal chunk split is not along a token boundary?<p>Well for one thing, this is rare, as tokenizers intentionally split along meaningful boundaries. However, if the best split really is after the &quot;f&quot; in &quot;fun&quot;, then you calculate<p>``` lp = logprob(prefix=&quot;Fine Tuning is&quot;, next_token=&quot; f&quot;) if lp &gt; -7: lp *= logprob(prefix=&quot;Fine Tuning is f&quot;, next_token=&quot;段&quot;) ```<p>In other words, if the logprob of token &quot; f&quot; has a non-negligible value &gt; -7, then we can multiply by the logprob of the token after that being &quot;段&quot;. The first line of code has a prefix that matches the main inference, so it does not need to be recalculated. However, if the &quot;if&quot; statement passes, then we&#x27;ll have to do an extra inference, costing us latency. However, this &quot;if&quot; statement almost never passes (In our measurement, it happens once every ~2000 tokens, so it amortizes well).<p>Check out the repo for examples and more technical details.

no comments

no comments