TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Leveraging SIMD: Splitting CSV Files at 3Gb/S

89 pointsby __exit__over 3 years ago

7 comments

zwegnerover 3 years ago
Pretty similar article from very recently: <a href="https:&#x2F;&#x2F;nullprogram.com&#x2F;blog&#x2F;2021&#x2F;12&#x2F;04&#x2F;" rel="nofollow">https:&#x2F;&#x2F;nullprogram.com&#x2F;blog&#x2F;2021&#x2F;12&#x2F;04&#x2F;</a><p>Discussion: <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=29439403" rel="nofollow">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=29439403</a><p>The article mentions in an addendum (and BeeOnRope also pointed it out in the HN thread) a nice CLMUL trick for dealing with quotes originally discovered by Geoff Langdale. That should work here for a nice speedup.<p>But without the CLMUL trick, I&#x27;d guess that the unaligned loads that generally occur after a vector containing both quotes and newlines in this version (the &quot;else&quot; case on lines 34-40) would hamper the performance somewhat, since it would eat up twice as much L1 cache bandwidth. I&#x27;d suggest dealing with the masks using bitwise operations in a loop, and letting i stay divisible by 16. Or just use CLMUL :)
评论 #29578016 未加载
评论 #29602606 未加载
评论 #29579395 未加载
jagrswover 3 years ago
Not sure how the author of this entry on HN managed to change original title from<p>gigabytes per second<p>to<p>gigabits per siemens<p>:)
评论 #29578893 未加载
评论 #29577530 未加载
评论 #29578005 未加载
评论 #29577457 未加载
mattewongover 3 years ago
Stay tuned for a SIMD powered CSV parser library and standalone utility about to drop this weekend. Alpha, but test showing it to be faster than anything else we could get our hands on
评论 #29620538 未加载
liuliuover 3 years ago
Splitting CSV file into chunks and process them independently won&#x27;t necessarily be wrong (although there are implementations out there that I won&#x27;t name would, because they do guess). The trick however requires to scan twice: <a href="https:&#x2F;&#x2F;liuliu.me&#x2F;eyes&#x2F;loading-csv-file-at-the-speed-limit-of-the-nvme-storage&#x2F;" rel="nofollow">https:&#x2F;&#x2F;liuliu.me&#x2F;eyes&#x2F;loading-csv-file-at-the-speed-limit-o...</a><p>Nice article otherwise!
michaelg7xover 3 years ago
Presumably solving the same kind of delimiter-finding issues as Hyperscan? <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=19270199" rel="nofollow">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=19270199</a>
评论 #29584257 未加载
Tuna-Fishover 3 years ago
Why is the unit expression in topic messed up?
rwmjover 3 years ago
Nice, but I&#x27;m afraid real world CSVs are a lot more complicated than described so don&#x27;t use this code in production.
评论 #29577046 未加载