Pretty similar article from very recently: <a href="https://nullprogram.com/blog/2021/12/04/" rel="nofollow">https://nullprogram.com/blog/2021/12/04/</a><p>Discussion: <a href="https://news.ycombinator.com/item?id=29439403" rel="nofollow">https://news.ycombinator.com/item?id=29439403</a><p>The article mentions in an addendum (and BeeOnRope also pointed it out in the HN thread) a nice CLMUL trick for dealing with quotes originally discovered by Geoff Langdale. That should work here for a nice speedup.<p>But without the CLMUL trick, I'd guess that the unaligned loads that generally occur after a vector containing both quotes and newlines in this version (the "else" case on lines 34-40) would hamper the performance somewhat, since it would eat up twice as much L1 cache bandwidth. I'd suggest dealing with the masks using bitwise operations in a loop, and letting i stay divisible by 16. Or just use CLMUL :)
Stay tuned for a SIMD powered CSV parser library and standalone utility about to drop this weekend. Alpha, but test showing it to be faster than anything else we could get our hands on
Splitting CSV file into chunks and process them independently won't necessarily be wrong (although there are implementations out there that I won't name would, because they do guess). The trick however requires to scan twice: <a href="https://liuliu.me/eyes/loading-csv-file-at-the-speed-limit-of-the-nvme-storage/" rel="nofollow">https://liuliu.me/eyes/loading-csv-file-at-the-speed-limit-o...</a><p>Nice article otherwise!
Presumably solving the same kind of delimiter-finding issues as Hyperscan?
<a href="https://news.ycombinator.com/item?id=19270199" rel="nofollow">https://news.ycombinator.com/item?id=19270199</a>