Hey HN!<p>I've written a lot of RAG pipelines over the last year, and one consistent pain-point is writing regex to chunk the documents correctly.<p>Right now, the most common chunking algorithms are:<p>- Split every 1000 characters<p>- Split on whitespace<p>- Recursively split on: (many newlines, then one newline, then periods, then spaces)<p>The best is recursive character text splitter, but regex is super brittle and when it fails to match it ends up creating huge chunks. Worse, this solution also has the overhead of needing to maintain regexes for every single filetype.<p>Here we propose LlamaChunk, an inference-efficient method of LLM-powered chunking. Using this method, it only requires a single LLM inference over your document in order to provide the most optimal recursive character text splitting, without needing to hope that a bunch of hard-coded rules work on your unstructured data.<p>We're hoping to build a community for state-of-the-art RAG research @ <a href="https://discord.com/invite/yv2hQQytne" rel="nofollow">https://discord.com/invite/yv2hQQytne</a> , come join!