TE
TechEcho
AccueilTop 24hRécentsMeilleursQuestionsPrésentationsEmplois
GitHubTwitter
Accueil

TechEcho

Une plateforme d'actualités technologiques construite avec Next.js, fournissant des nouvelles et discussions technologiques mondiales.

GitHubTwitter

Accueil

AccueilRécentsMeilleursQuestionsPrésentationsEmplois

Ressources

HackerNews APIHackerNews OriginalNext.js

© 2025 TechEcho. Tous droits réservés.

21 GB/s CSV Parsing Using SIMD on AMD 9950X

313 pointspar zigzag312il y a 3 jours

15 comments

chao-il y a 2 jours
It feels crazy to me that Intel spent years dedicating die space on consumer SKUs to &quot;make fetch happen&quot; with AVX-512, and as more and more libraries are finally using it, as Intel&#x27;s goal is achieved, they have removed AVX-512 from their consumer SKUs.<p>It isn&#x27;t that AMD has better AVX-512 support, which would be an impressive upset on it&#x27;s own. Instead, it is only that AMD has AVX-512 on consumer CPUs, because Intel walked away from their own investment.
评论 #43938985 未加载
评论 #43940949 未加载
评论 #43955996 未加载
评论 #43941282 未加载
评论 #43938869 未加载
评论 #43940206 未加载
评论 #43939606 未加载
评论 #43938848 未加载
评论 #43938741 未加载
评论 #43938749 未加载
stabblesil y a 2 jours
Instead of doing 4 comparisons against each character `\n`, `\r`, `;` and `&quot;` followed by 3 or operations, a common trick is to do 1 shuffle, 1 comparison and 0 or operations. I blogged about this trick: <a href="https:&#x2F;&#x2F;stoppels.ch&#x2F;2022&#x2F;11&#x2F;30&#x2F;io-is-no-longer-the-bottleneck-part-2.html" rel="nofollow">https:&#x2F;&#x2F;stoppels.ch&#x2F;2022&#x2F;11&#x2F;30&#x2F;io-is-no-longer-the-bottlenec...</a> (Trick 2)<p>Edit: they do make use of ternary logic to avoid one or operation, which is nice. Basically (a | b | c) | d is computed using `vpternlogd` and `vpor` resp.
评论 #43939679 未加载
Aardwolfil y a 2 jours
Take that, Intel and your &quot;let&#x27;s remove AVX-512 from every consumer CPU because we want to put slow cores on every single one of them and also not consider multi-pumping it&quot;
评论 #43939288 未加载
winterbloomil y a 3 jours
This is a staggering ~3x improvement in just under 2 years since Sep was introduced June, 2023.<p>You can&#x27;t claim this when you also do a huge hardware jump
评论 #43937050 未加载
评论 #43936989 未加载
评论 #43942779 未加载
评论 #43939269 未加载
评论 #43936979 未加载
vessenesil y a 2 jours
If we are lucky we will see Arthur Whitney get triggered and post either a one liner beating this or a shakti engine update and a one liner beating this. Progress!
voidUpdateil y a 3 jours
I shudder to think who needs to process a million lines of csv that fast...
评论 #43938823 未加载
评论 #43937617 未加载
评论 #43937080 未加载
评论 #43937568 未加载
评论 #43937063 未加载
评论 #43939869 未加载
评论 #43938894 未加载
评论 #43939567 未加载
评论 #43940169 未加载
criddellil y a 3 jours
I was expecting to see assembly language and was pleasantly surprised to see C#. Very impressive.<p>Nice work!
评论 #43938854 未加载
habermanil y a 2 jours
The article doesn&#x27;t clearly define what this 21 GB&#x2F;s code is doing.<p>- What format exactly is it parsing? (eg. does the dialect of CSV support quoted commas, or is the parser merely looking for commas and newlines)?<p>- What is the parser doing with the result (ie. populating a data structure, etc)?
评论 #43943463 未加载
imtringuedil y a 2 jours
Considering the non-standard nature of CSV, quoting throughput numbers in bytes is meaningless. It makes sense for JSON, since you know what the output is going to be (e.g. floats, integers, strings, hashmaps, etc). With CSV you only get strings for each column, so 21 GB&#x2F;s of comma splitting would be the pinnacle of meaninglessness. Like, okay, but I still have to parse the stringy data, so what gives? Yeah, the blog post does reference float parsing, but a single float per line would count as &quot;CSV&quot;.<p>Now someone might counter and say that I should just read the README.MD, but then that suspicion simply turns out to be true: They don&#x27;t actually do any escaping or quoting by default, making the quoted numbers an example of heavily misleading advertising.
评论 #43938477 未加载
constantcryingil y a 2 jours
There are very good alternatives to csv for storing and exchanging floating point&#x2F;other data.<p>The HDF5 format is very good and allows far more structure in your files, as well as metadata and different types of lossless and lossy compression.
chpatrickil y a 2 jours
In my experience I&#x27;ve found it difficult to get substantial gains with custom SIMD code compared to modern compiler auto-vectorization, but to be fair that was with more vector-friendly code than JSON parsing.
theropostil y a 2 jours
I need this, just finished 300GB of CSV extracts, and manipulating, data integrity checks, and so on take longer than they should.
评论 #43943992 未加载
gitroomil y a 2 jours
tbh the way intel keeps killing cool tech gets on my nerves - wish they&#x27;d just stick it out for once
anthkil y a 2 jours
&gt; Net 9.0<p>heh, do it again with mawk.
zeristoril y a 2 jours
Why not use Parquet?
评论 #43939884 未加载
评论 #43938218 未加载
评论 #43942382 未加载