TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Show HN: Parsing CSV files with GPU

123 点作者 antonmks大约 10 年前

15 条评论

andrewguenther大约 10 年前
This title is incredibly misleading.<p>* This isn&#x27;t parsing a CSV, this is a program written to split this exact dataset. (The code is filled with hard coded values)<p>* You&#x27;re comparing a single-threaded run on a low-end CPU to a top-tier GPU.<p>* Your dataset can fit into GPU memory.<p>* There is a pull request for a missing semicolon, which means the posted version of the code won&#x27;t even compile, so couldn&#x27;t have been the version used to generate the benchmarks.<p>* The amount of branching in the GPU code makes it hard for me to believe that it actually ran that fast. GPU parallelism does not work well with branching since all cores in a cube must executing in lock-step, if you branch, then you now have to go back and execute all of your branches separately.
评论 #9490312 未加载
castratikron大约 10 年前
I&#x27;m skeptical of the 8x speedup for several reasons, the main one being that this particular problem does not fit the paradigm of problems that work well on the GPU; the GPU cache is not used at all, and there are also many branches. You need to be able to use the cache of the GPU in your application, otherwise your performance is guaranteed to be memory-bound. The reason you want to avoid branches is that there is only one control unit per a number of cores on the GPU, which means that if some threads follow one branch they will have to stall until the other threads complete. Generally the only code that maps well to the GPU is that which contains large for loops and has good spacial locality (e.g. matrix multiplication).<p>The author is comparing a GPU to a CPU, yet the CPU is only running a single thread (supposedly, the author did not provide the CPU code used in the comparison). For a true comparison the full capability of the CPU should be exposed by means of a multithreaded application (and, as someone else has already mentioned, vector instructions such as SSE). Think performance per socket, not performance per thread.
评论 #9486700 未加载
评论 #9486762 未加载
评论 #9486741 未加载
评论 #9489327 未加载
kazinator大约 10 年前
Though there is no standard definition of CSV, de facto processing it properly requires recognizing quotes, and also escapes of literal quotes using double quoting:<p><pre><code> this, &quot;is, like, CSV&quot;, &quot;with three so-called &quot;&quot;fields&quot;&quot;&quot; </code></pre> Note that unquoted leading and trailing whitespace, and whitespace around the commas, is deleted, too.<p>(See CSV page in the Wikipedia)<p>A GPU-accelerated string split could be useful but it&#x27;s not quite &quot;parsing CSV&quot;.
评论 #9487816 未加载
评论 #9487558 未加载
评论 #9487489 未加载
paulmd大约 10 年前
I kinda suspect he might be measuring the time it takes to launch a kernel rather than the time it takes the kernel to complete.<p>Thrust device calls, like those of the underlying CUDA library, are asynchronous by default. The only exception is calls that result in a memcpy, which are synchronous. To wait until an async call is completed you need to call one of the synchronize commands, like cudaDeviceSynchronize.<p>Looking through his test.cu file, he snaps a timestamp using std::clock right after doing the kernel launch with for_each. Ignoring the fact that this is not an accurate way to benchmark a GPU (you need to use events to accurately benchmark the kernel) what you&#x27;re capturing will just be the processor time it takes to make the async kernel launch. Std::clock measures CPU time, which is (rightly) close to 0 for a program that runs on the GPU.<p>It&#x27;s entirely possible that you&#x27;re not even getting valid results out of the other end - note that you don&#x27;t show output. I don&#x27;t know if thrust&#x27;s magic device memory access function triggers a synchronization or not. I kinda remember having to make an explicit call when I did a GPU simulation.<p>I don&#x27;t have access to a CUDA box at the moment, I&#x27;d have to add those cudaDeviceSynchronize calls after the for_each invocations to be sure.
评论 #9490265 未加载
评论 #9489296 未加载
roel_v大约 10 年前
I don&#x27;t understand the benchmark. How can a 750GB file be read from disk in 1.5 seconds, let alone be parsed? He mentions it&#x27;s a 2TB drive, so it&#x27;s not even SSD presumably?
评论 #9486077 未加载
victorNicollet大约 10 年前
Very interesting. From my experience, the hard part about parsing CSV isn&#x27;t to identify the individual cells, but rather parsing those cells afterwards (as numbers, dates, etc).<p>What is the performance of those operations (e.g. parsing YYYY-MM-DD dates to Unix timestamps) when performed on the GPU ?<p>My company actually picked another optimization strategy, by making the tokenization significantly longer, but it de-duplicates the tokenized cells so that each distinct cell value (a date, a number, a string) can be parsed exactly once. We have seen some fairly good results out of this, compared to the naive approach of stream-token-parse:<p><a href="https:&#x2F;&#x2F;github.com&#x2F;Lokad&#x2F;lokad-flatfiles" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;Lokad&#x2F;lokad-flatfiles</a>
评论 #9490730 未加载
mholt大约 10 年前
I gotta admit, having written a high-speed, multi-threaded streaming CSV parser for the browser[1], I&#x27;m quite impressed by your project. I&#x27;ve done CUDA programming before and it&#x27;s not easy (albeit libraries do help). Good work!<p>[1] <a href="http:&#x2F;&#x2F;papaparse.com" rel="nofollow">http:&#x2F;&#x2F;papaparse.com</a>
评论 #9488044 未加载
hannibalhorn大约 10 年前
For file sizes where parsing speed really makes a difference, loading the entire file into memory doesn&#x27;t seem feasible. Maybe some sort of a hybrid approach (load chunks into memory and parse them via the GPU) would provide some real benefits though.
评论 #9486964 未加载
评论 #9486218 未加载
评论 #9487072 未加载
avarsheny大约 10 年前
I don&#x27;t think CPU and GPU is going to make much difference here. You speed up in GPU run could be because of hot cache of file system. Try running GPU run first and non-GPU after that and plz post the results.
评论 #9486243 未加载
WhitneyLand大约 10 年前
I like your creativity in applying the GPU to a less typical task.<p>Is it the case that the CPU version could be sped up dramatically by using multiple cores and a variation on the line splitting technique?
评论 #9485971 未加载
评论 #9485937 未加载
rdc12大约 10 年前
&quot;However this method of parsing is CPU bound because * it doesn&#x27;t take advantage of multiple cores of modern CPUs. * memory bandwidth limitations&quot;<p>So not CPU bound at all then.
snissn大约 10 年前
I think it would be really cool to integrate a GPU into a database to speed up certain operations if an optimizer can decide that certain parts of a query will benefit from it. The thrust docs [0] indicate that the C++ gpu library can be used effectively for sorting, I wonder if sorts on non-indexed fields can be sped up by attaching a GPU to my database!<p>[0] <a href="http:&#x2F;&#x2F;thrust.github.io&#x2F;" rel="nofollow">http:&#x2F;&#x2F;thrust.github.io&#x2F;</a>
maxhou大约 10 年前
In the real world, the slow part of &quot;parsing&quot; a CSV file is IO: reading the file content from disk to memory, and from memory to CPU cache.<p>You would avoid reading the file content more than once if you had to parse it.<p>&gt; The first line counts the number of lines in a buffer (assuming that file is read into memory and copied to gpu buffer d_readbuff).<p>but this is what is done here, first search to find all \n, then multi-core GPU stuff for each line content.
评论 #9488235 未加载
ape4大约 10 年前
In the real world... read the CSV into a database (doesn&#x27;t really matter how fast or slow it is). Access the data from the database.
评论 #9486943 未加载
评论 #9486861 未加载
评论 #9487764 未加载
aftbit大约 10 年前
Am I crazy? With a hot disk cache, the cut command he gave takes 1.2 seconds on my machine.