TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

LlamaChunk: Better RAG Chunking Than LlamaIndex

15 pointsby npip997 months ago

1 comment

npip997 months ago
Hey HN!<p>I&#x27;ve written a lot of RAG pipelines over the last year, and one consistent pain-point is writing regex to chunk the documents correctly.<p>Right now, the most common chunking algorithms are:<p>- Split every 1000 characters<p>- Split on whitespace<p>- Recursively split on: (many newlines, then one newline, then periods, then spaces)<p>The best is recursive character text splitter, but regex is super brittle and when it fails to match it ends up creating huge chunks. Worse, this solution also has the overhead of needing to maintain regexes for every single filetype.<p>Here we propose LlamaChunk, an inference-efficient method of LLM-powered chunking. Using this method, it only requires a single LLM inference over your document in order to provide the most optimal recursive character text splitting, without needing to hope that a bunch of hard-coded rules work on your unstructured data.<p>We&#x27;re hoping to build a community for state-of-the-art RAG research @ <a href="https:&#x2F;&#x2F;discord.com&#x2F;invite&#x2F;yv2hQQytne" rel="nofollow">https:&#x2F;&#x2F;discord.com&#x2F;invite&#x2F;yv2hQQytne</a> , come join!
评论 #42148649 未加载
评论 #42148722 未加载