TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

I/O is no longer the bottleneck

377 点作者 benhoyt超过 2 年前

62 条评论

Sirupsen超过 2 年前
Yes, sequential I&#x2F;O bandwidth is closing the gap to memory. [1] The I&#x2F;O pattern to watch out for, and the biggest reason why e.g. databases do careful caching to memory, is that _random_ I&#x2F;O is still dreadfully slow. I&#x2F;O bandwidth is brilliant, but latency is still disappointing compared to memory. Not to mention, in typical Cloud workloads, IOPS are far more expensive than memory.<p>[1]: <a href="https:&#x2F;&#x2F;github.com&#x2F;sirupsen&#x2F;napkin-math" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;sirupsen&#x2F;napkin-math</a>
评论 #33754776 未加载
评论 #33752185 未加载
评论 #33755474 未加载
评论 #33752403 未加载
评论 #33752411 未加载
评论 #33762930 未加载
vitiral超过 2 年前
I question the methodology.<p>To measure this I would have N processes reading the file from disk with the max number of parallel heads (typically 16 I think). These would go straight into memory. It&#x27;s possible you could do this with one process and the kernel will split up the block read into 16 parallel reads as well, needs investigation.<p>Then I would use the rest of the compute for number crunching as fast as possible using as many available cores as possible: for this problem, I think that would basically boil down to a map reduce. Possibly a lock-free concurrent hashmap could be competitive.<p>Now, run these in parallel and measure the real time from start to finish of both. Also gets the total CPU time spent for reference.<p>I&#x27;m pretty sure the author&#x27;s results are polluted: while they are processing data the kernel is caching the next block. Also, it&#x27;s not really fair to compare single threaded disk IO to a single process: one of the reasons for IO being a bottleneck is that it has concurrency constraints. Never the less I would be interested in both the single threaded and concurrent results.
评论 #33756313 未加载
评论 #33753547 未加载
评论 #33758668 未加载
Dunedan超过 2 年前
&gt; I haven’t shown an optimized Python version because it’s hard to optimize Python much further! (I got the time down from 8.4 to 7.5 seconds). It’s as fast as it is because the core operations are happening in C code – that’s why it so often doesn’t matter that “Python is slow”.<p>An obvious optimization would be to utilize all available CPU cores by using the MapReduce pattern with multiple threads.<p>I believe that&#x27;d be necessary for a fair conclusion anyway, as you can&#x27;t claim that I&#x2F;O isn&#x27;t the bottleneck, without utilizing all of the available CPU and memory resources.
评论 #33752065 未加载
评论 #33752576 未加载
评论 #33752098 未加载
评论 #33754775 未加载
评论 #33752486 未加载
throwaway71271超过 2 年前
EBS costs crazy crazy amounts for reasonable iops<p>We pay 7k per month for RDS that can do barely 2k iops.. in the same time a machine at hetzner does 2 million iops for 250 euro per month (not to mention it also have 4x more codes and 5x more ram).<p>So, even though I&#x2F;O is no longer the bottle neck physically, it still is a considerable issue and design challenge on the cloud.
评论 #33752360 未加载
评论 #33753501 未加载
SleepyMyroslav超过 2 年前
the state of benchmarking by normal IT people is tragic. If one checks out his &#x27;optimization problem statement&#x27; article [1] they can find:<p>&gt;ASCII: it’s okay to only support ASCII<p>&gt;Threading: it should run in a single thread on a single machine<p>&gt;Stdlib: only use the language’s standard library functions.<p>This is truly 1978 all over again. No flame graphs, no hardware counters, no bottleneck analysis. Using these &#x27;optimizations&#x27; for job interviews is questionable at best.<p>[1] <a href="https:&#x2F;&#x2F;benhoyt.com&#x2F;writings&#x2F;count-words&#x2F;" rel="nofollow">https:&#x2F;&#x2F;benhoyt.com&#x2F;writings&#x2F;count-words&#x2F;</a>
评论 #33759022 未加载
评论 #33753536 未加载
MrLeap超过 2 年前
Interesting! This made me wonder -- would this kind of optimization be recognized and rewarded in colossal scale organizations?<p>I&#x27;ve seen comments about Google multiple times here where people say you wont be getting promotions unless you&#x27;re shipping new things -- maintaining the old wont do it.<p>But if you get to something core enough, it seems like the numbers would be pretty tangible and easy to point to during perf review time?<p>&quot;Found a smoother way to sort numbers that reduced the &quot;whirrrrrr&quot; noise our disks made. It turns out this reduces disk failure rates by 1%, arrested nanoscale structural damage to the buildings our servers are in, allowed a reduction in necessary PPE, elongaded depreciation offsets and other things -- this one line of code has saved Google a billion dollars. That&#x27;s why my compensation should be increased to include allowing me to fall limply into the arms of another and be carried, drooling, into the office, where others will dress me&quot;<p>In this hypothetical scenario, would a Googler be told &quot;Your request has been approved, it may take one or two payment periods before your new benefits break into your apartment&quot; or &quot;No, you need to ship another chat program before you&#x27;re eligible for that.&quot;?
评论 #33753766 未加载
评论 #33752077 未加载
评论 #33754809 未加载
评论 #33754918 未加载
评论 #33751854 未加载
评论 #33753781 未加载
评论 #33752215 未加载
mastax超过 2 年前
I was recently working on parsing 100K CSV files and inserting them into a database. The files have a non-column-oriented header and other idiosyncrasies so they can&#x27;t be directly imported easily. They&#x27;re stored on an HDD so my first instinct was to focus on I&#x2F;O: read the whole file into memory as an async operation so that there are fewer larger IOs to help the HDD and so that other parser tasks can do work while waiting for the read to complete. I used a pretty featureful C# CSV parsing library which did pretty well on benchmarks [0] (CsvHelper) so I wasn&#x27;t really worried about that part.<p>But that intuition was completely wrong. The 100K CSV files only add up to about 2GB. Despite being many small files reading through them all is pretty fast the first time, even on Windows, and then they&#x27;re in the cache and you can ripgrep through them all almost instantaneously. The pretty fast parser library is fast because it uses runtime code generation for the specific object type that is being deserialized. The overhead of allocating a bunch of complex parser and typeconverter objects, doing reflection on the parsed types, and generating code for a parser, means that for parsing lots of tiny files its really slow.<p>I had to stop worrying about it because 2 minutes is fast enough for a batch import process but it bothers me still.<p>Edit: CsvHelper doesn&#x27;t have APIs to reuse parser objects. I tested patching in a ConcurrentDictionary to cache the generated code and it massively sped up the import. But again it was fast enough and I couldn&#x27;t let myself get nerd sniped.<p>Edit2: the import process would run in production on a server with low average load, 256GB RAM, and ZFS with zstd compression. So the CSV files will live permanently in the page cache and ZFS ARC. The import will probably run a few dozen times a day to catch changes. IO is really not going to be the problem. In fact, it would probably speed things up to switch to synchronous reads and remove all the async overhead. Oh well.<p>[0]: <a href="https:&#x2F;&#x2F;www.joelverhagen.com&#x2F;blog&#x2F;2020&#x2F;12&#x2F;fastest-net-csv-parsers" rel="nofollow">https:&#x2F;&#x2F;www.joelverhagen.com&#x2F;blog&#x2F;2020&#x2F;12&#x2F;fastest-net-csv-pa...</a>
samsquire超过 2 年前
My immediate thought was are you measuring throughput or latency?<p>The latency of reading from disk is indeed very slow compared to CPU instructions.<p>A 3ghz clock speed processor is running 3 billion (3,000,000,000 cycles a second) and some instructions take 1 cycle. You get 3 cycles per nanosecond. A SSD or spinning disk access costs many multiples of cycles.<p>Read 1 MB sequentially from SSD* 1,000,000<p>That&#x27;s a lot of time that could be spent doing additions or looping.<p><a href="https:&#x2F;&#x2F;gist.github.com&#x2F;jboner&#x2F;2841832" rel="nofollow">https:&#x2F;&#x2F;gist.github.com&#x2F;jboner&#x2F;2841832</a>
评论 #33752105 未加载
评论 #33752104 未加载
评论 #33752018 未加载
austin-cheney超过 2 年前
I encountered this myself yesterday when attempting to performance test WebSockets in JavaScript: <a href="https:&#x2F;&#x2F;github.com&#x2F;prettydiff&#x2F;share-file-systems&#x2F;blob&#x2F;master&#x2F;documentation&#x2F;websockets.md#challenges" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;prettydiff&#x2F;share-file-systems&#x2F;blob&#x2F;master...</a><p>The parsing challenge is complex enough that it will always be faster to extract the data from the network than it is to process it. As a result excess data must be stored until it can be evaluated or else it must be dropped, therefore the primary processing limitation is memory access not CPU speed executing instructions. JavaScript is a garbage collected language, so you are at the mercy of the language and it doesn&#x27;t really matter how you write the code because if the message input frequency is high enough and large enough memory will always be the bottleneck, not the network or the application code.<p>In terms of numbers this is provable. When testing WebSocket performance on my old desktop with DDR3 memory I was sending messages (without a queue or any kind of safety consideration) at about 180,000 messages per second. In my laptop with DDR4 memory the same test indicated a message send speed at about 420,000 messages per second. The CPU in the old desktop is faster and more powerful than the CPU in the laptop.
fancyfredbot超过 2 年前
NVME storage really is very fast for sequential reads, but I&#x27;d respectfully suggest that for simple tasks a Dell laptop with 1.6GB&#x2F;s read speed should be bottlenecked by IO if the compute is optimised. For example SIMD-json can parse json at over 7GB&#x2F;s. <a href="https:&#x2F;&#x2F;simdjson.org&#x2F;" rel="nofollow">https:&#x2F;&#x2F;simdjson.org&#x2F;</a>
brianolson超过 2 年前
SSD is pretty fast; but my app is actually trying to do more than 100_000 read-modify-write cycles per second and that still requires careful thought about the database and schema we&#x27;re using.<p>CPU and RAM are pretty fast. I do a live-coding interview question and I ask people do to a naive implementation first, then later I ask about possible optimizations. A third to a half of candidates want to do fewer RAM accesses and oh by is that the wrong avenue for this problem - especially when they just wrote their solution in Python and you could get a 10x-20x speedup by rewrite in C&#x2F;C++&#x2F;Go&#x2F;Rust&#x2F;etc.<p>Network is IO too. Network is pretty fast, datacenter-to-datacenter, but end users can still have their experience improved with better encoding and protocol; and outbound bandwidth bills can be improved by that too.
评论 #33758541 未加载
chewbacha超过 2 年前
Wouldn’t memory allocation still be IO of a different resource? We’re still getting slowed down reading and writing bits to a storage device. Perhaps it’s not the hard drive but the claimed blocker here doesn’t appear to be CPU.
评论 #33752246 未加载
评论 #33752853 未加载
评论 #33752252 未加载
评论 #33752003 未加载
mrkeen超过 2 年前
A few ballpark numbers I encountered:<p>Sequentially reading a file on a spinny laptop disk was about 80-100 MB&#x2F;s. On an SSD that went up to 400-500 MB&#x2F;s for me.<p>That&#x27;s the sequential case! What about random access? I tried an experiment where I memory mapped a large file and started updating bytes at random. I could get the rate down to kilobytes&#x2F;sec.<p>Even though we&#x27;ve all heard that SSDs don&#x27;t pay as much as a penalty for random access as spinny disks, it&#x27;s still a huge penalty. Sequential spinny disk access is faster than SSD random access.
评论 #33751973 未加载
评论 #33752005 未加载
评论 #33751960 未加载
评论 #33751970 未加载
Dunedan超过 2 年前
As the algorithm used in the example is straight-forward I figured that using UNIX command line tools might be an even simpler way to implement it. Here is what I came up with:<p><pre><code> time cat kjvbible_x100.txt | tr &quot;[:upper:] &quot; &quot;[:lower:]\n&quot; | sort --buffer-size=50M | uniq -c | sort -hr &gt; &#x2F;dev&#x2F;null </code></pre> On my machine this turned out to be ~5 times slower than the provided Python implementation. Nearly all of the time is spent in the first invocation of `sort`. Further increasing the buffer size doesn&#x27;t make a significant difference. I also played around with the number of threads `sort` uses, but didn&#x27;t see any improvement there either.<p>I&#x27;m quite puzzled why `sort is so much slower, especially as it does sorting in parallel utilizing multiple CPU cores, while the Python implementation is single-threaded.<p>Does somebody have an explanation for that?
评论 #33756509 未加载
评论 #33755614 未加载
评论 #33755742 未加载
评论 #33755633 未加载
nanis超过 2 年前
Related posts from my past experience from about 10 years ago:<p>* Splitting long lines is slow[1]<p>* Can Parallel::ForkManager speed up a seemingly IO bound task?[2]<p>In both cases, Perl is the language used (with a little C thrown in for [1]), but they are in a similar vein to the topic of this post. In [1], I show that the slowness in processing large files line by line is not due to I&#x2F;O, but due to the amount of work done by code. In [2], a seemingly I&#x2F;O bound task is sped up by throwing more CPU at it.<p>[1]: <a href="https:&#x2F;&#x2F;www.nu42.com&#x2F;2013&#x2F;02&#x2F;splitting-long-lines-is-slow.html" rel="nofollow">https:&#x2F;&#x2F;www.nu42.com&#x2F;2013&#x2F;02&#x2F;splitting-long-lines-is-slow.ht...</a><p>[2]: <a href="https:&#x2F;&#x2F;www.nu42.com&#x2F;2012&#x2F;04&#x2F;can-parallelforkmanager-speed-up.html" rel="nofollow">https:&#x2F;&#x2F;www.nu42.com&#x2F;2012&#x2F;04&#x2F;can-parallelforkmanager-speed-u...</a>
iAm25626超过 2 年前
IO also meant network to me. Often the target(database, or device generating telemetry) is 10+ms away. That round trip time is bottle neck by physics(speed of light). side benefit of sqlite being local file system&#x2F;memory.
prvt超过 2 年前
&quot;sorting with O(n^2) is no longer a bottleneck as we have fast processors&quot; &#x2F;s
评论 #33752548 未加载
stabbles超过 2 年前
The title I&#x2F;O is _no longer_ the bottleneck seems to suggest disk speed has caught up, while in reality the slowness is due to poor implementation (slow Python or Go with lots of allocations).<p>The real problem to me is that languages are too high-level and hiding temporary allocations too much. If you had to write this in C, you would naturally avoid unnecessary allocations, cause alloc &#x2F; free in the hot loop looks bad.<p>Presumably soon enough it&#x27;s very unlikely you find any new word (actually it&#x27;s 10 passes over the same text) and most keys exist in the hashmap, so it would be doing a lookup and incrementing a counter, which should not require allocations.<p>Edit: OK, I&#x27;ve ran OP&#x27;s optimize C-version [1] and indeed, it only hits 270MB&#x2F;s. So, OP&#x27;s point remains valid. Perf tells me that 23% of all cache refs are misses, so I wonder if it can be optimized to group counters of common words together.<p>[1] <a href="https:&#x2F;&#x2F;benhoyt.com&#x2F;writings&#x2F;count-words&#x2F;" rel="nofollow">https:&#x2F;&#x2F;benhoyt.com&#x2F;writings&#x2F;count-words&#x2F;</a>
评论 #33752193 未加载
评论 #33752294 未加载
评论 #33752357 未加载
评论 #33752657 未加载
评论 #33760125 未加载
评论 #33753464 未加载
评论 #33752086 未加载
mastazi超过 2 年前
Networking is I&#x2F;O: API calls, database access, etc. - it&#x27;s not just disk access. The article is deriving a generalised statement based on a very specific use case.
notacoward超过 2 年前
Interesting observation, but I think the author crosses a bridge too far here.<p>&gt; If you’re processing “big data”, disk I&#x2F;O probably isn’t the bottleneck.<p>If it fits on a single machine, it is <i>by definition</i> not big data. When you&#x27;re dealing with really big data, it&#x27;s likely coming from another machine, or more likely a cluster of them. Networks can also be pretty fast, but there will still be some delay associated with that plus the I&#x2F;O (which might well be on spinning rust instead of flash) on the other end. Big data requires parallelism to cover those latencies. <i>Requires.</i> It might be true that I&#x2F;O is no longer likely to be the bottleneck for a single-threaded program, but leave &quot;big data&quot; out of it because in that world I&#x2F;O really is still a - if not the - major limiter.
评论 #33756592 未加载
avinassh超过 2 年前
I am working on a project [0] to generate 1 billion rows in SQLite under a minute and inserted 100M rows inserts in 33 seconds. First, I generate the rows and insert them in an in-memory database, then flush them to the disk at the end. To flush it to disk it takes only 2 seconds, so 99% of the time is being spent generating and adding rows to the in-memory B Tree.<p>For Python optimisation, have you tried PyPy? I ran my same code (zero changes) using PyPy, and I got 3.5x better speed.<p>I published my findings here [1].<p>[0] - <a href="https:&#x2F;&#x2F;github.com&#x2F;avinassh&#x2F;fast-sqlite3-inserts" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;avinassh&#x2F;fast-sqlite3-inserts</a><p>[1] - <a href="https:&#x2F;&#x2F;avi.im&#x2F;blag&#x2F;2021&#x2F;fast-sqlite-inserts&#x2F;" rel="nofollow">https:&#x2F;&#x2F;avi.im&#x2F;blag&#x2F;2021&#x2F;fast-sqlite-inserts&#x2F;</a>
评论 #33772503 未加载
namkt超过 2 年前
I would have thought that allocations in managed languages like Go&#x2F;Python would have been the &quot;fast&quot; part of the processing. Isn&#x27;t technically the GC that&#x27;s slowing you down, and not the allocation per se? For one-shot input&#x2F;output programs like these I guess you could tune the GC to kick in with less frequency.<p>You also note that <i>reading a file sequentially from disk is very fast</i>, which it is, but there is no guarantee that the file&#x27;s contents are actually sequential on disk (fragmentation), right? We&#x27;d have to see how the file was written, and I guess at worst you&#x27;d be reading sequential chunks of a hand-wavy 4KB or something depending on the file system and what not. I&#x27;m sure others can fill in the details.<p>Just nit-picking here.
评论 #33751895 未加载
评论 #33751924 未加载
hansvm超过 2 年前
I&#x2F;O is still often the bottleneck. My laptop can handle 11 GB&#x2F;s through RAM (and no NVME, so under 1 GB&#x2F;s through the hard drive), less with unpredictable I&#x2F;O patterns (like a hash-map) and 7600 GB&#x2F;s through the CPU. Unless the thing you&#x27;re doing is particularly expensive per byte of data, you&#x27;re going to be limited at a minimum by RAM I&#x2F;O, and maybe by DISK I&#x2F;O.<p>FWIW, all my recent performance wins have either been by reducing RAM I&#x2F;O or restructuring work to reduce contention in the memory controller, even at the cost of adding significantly more work to the CPU.
nimish超过 2 年前
If you ignore latency sure. Optane is still the fastest storage by quite a bit. Flash has yet to catch up and might never do so.<p>Tons of files and random writes can bring even an enterprise flash ssd to its knees but Optane keeps on trucking
kazinator超过 2 年前
Processing <i>cache hot</i> data has never been the bottleneck.<p>Running some search on a file on your 486-with-8-megs-of-RAM running Linux, where the file was in the operating system&#x27;s cache, was dependent on the performance of the program, and the overhead of reading data from the cache through syscalls.<p>You can&#x27;t handwave away the performance of the program with the argument that it will hide behind I&#x2F;O even if that is true for cache-cold run because cache-hot performance is important. People run multiple different kinds of processing passes on the same data.
edejong超过 2 年前
&quot;Optimised&quot; code according to the author. What can we do to optimise further?<p>- Read file in one thread pool, streaming the chunks to...<p>- ...another thread pool, tokenise, count, sort the chunks and send them to ...<p>- ... merge in another thread pool. (basically map-reduce).<p>- please stop malloc&#x27;ing for each token<p>- prealloc map for found tokens (better to just allocate room for 200k words).<p>- SIMD would optimise your inner-loop quite a lot. However, there are optimised libraries for this, so you don&#x27;t have to write this yourself.<p>- `word = append(word, c)` &lt;= this is very slow<p>Why is there no profiling? Why don&#x27;t you check how the compiler interpreted your code and benchmark the subparts?<p>In addition, there are at least some errors in your optimised program:<p>- you can&#x27;t lower-case by substract like you do. Non-ascii characters would fail.<p>- also you can&#x27;t tokenising by comparing with c &lt;= &#x27; &#x27;. There are many characters which would break a string. See this exercise: <a href="https:&#x2F;&#x2F;campus.datacamp.com&#x2F;courses&#x2F;introduction-to-natural-language-processing-in-python&#x2F;regular-expressions-word-tokenization?ex=10" rel="nofollow">https:&#x2F;&#x2F;campus.datacamp.com&#x2F;courses&#x2F;introduction-to-natural-...</a>
评论 #33755201 未加载
tonetheman超过 2 年前
I&#x2F;O is no longer the bottleneck for silly interview questions for the most part. But for real programs it can still be an issue.
lmeyerov超过 2 年前
I like the author started with measuring and thinking bandwidth, which makes sense for streaming through a big file, so I&#x27;d have continued that way towards a diff design &amp; conclusion<p>Continuing with standard python (pydata) and ok hw:<p>- 1 cheap ssd: 1-2 GB&#x2F;s<p>- 8 core (3 GHz) x 8 SIMD: 1-3 TFLOPS?<p>- 1 pci card: 10+ GB&#x2F;s<p>- 1 cheapo GPU: 1-3 TFLOPS?<p>($$$: cross-fancy-multi-GPU bandwidth: 1 TB&#x2F;s)<p>For streaming like word count, the Floating point operation (proxy for actual ops) to Read ratio is unclear, and the above supports 1000:1 . Where the author is reaching the roofline on either is a fun detour, so I&#x27;ll switch to what I&#x27;d expect of pydata python.<p>It&#x27;s fun to do something like run regexes on logs use cudf one liners (GPU port of pandas) and figure out the bottleneck. 1 GB&#x2F;s sounds low, I&#x27;d expect the compute to be more like 20GB+&#x2F;s for in-memory, so they&#x27;d need to chain 10+ SSD achieve that, and good chance the PCI card would still be fine. At 2-5x more compute, the PCI card would probably become a new bottleneck.
tuyguntn超过 2 年前
Haven&#x27;t gone through the code, but measurement methodology seems wrong to me.<p>&gt; As you can see, the disk I&#x2F;O in the simple Go version takes only 14% of the running time. In the optimized version, we’ve sped up both reading and processing, and the disk I&#x2F;O takes only 7% of the total.<p>1. If I&#x2F;O wasn&#x27;t a bottleneck, shouldn&#x27;t we optimize only reading to have comparable benchmarks?<p>2. Imagine program was running 100 sec, (14% I&#x2F;O) so 14 seconds are spent on I&#x2F;O. Now we optimize processing and total time became 70 seconds, if I&#x2F;O wasn&#x27;t a bottleneck, and we haven&#x27;t optimized I&#x2F;O, total disk I&#x2F;O should become 20% of total execution time, not 7%.<p>Disk I&#x2F;O:<p>&gt; Go simple (0.499), Go optimized (0.154)<p>clearly, I&#x2F;O access was optimized 3x and total execution was optimized 1.6x times. This is not a good way of measurement to say I&#x2F;O is not a bottleneck.<p>I agree though things are getting faster.
Un1corn超过 2 年前
Does someone have a comparison between common server ssds and consumer ssds? I wonder if the speed is equal or not
评论 #33751702 未加载
评论 #33751788 未加载
评论 #33751791 未加载
mips_avatar超过 2 年前
I&#x2F;O is sometimes the bottleneck. On Windows any operation with lots of small file operations bottlenecks on NTFS and Defender operations. It makes some applications like git that run beautifully on Linux need to take some countermeasures to operate well on Windows.
vinay_ys超过 2 年前
Storage and compute separation is key to scaling data workloads. Here, scaling could be w.r.t volume&#x2F;shape of data, number of concurrent jobs on the same dataset, complexity of each job etc. In such an architecture, network access is unavoidable. And, to if you have multiple jobs competing for access to the same dataset concurrently, your sequential access can turn into semi-random access. You also have concerns about utilization of resources while being scalable w.r.t arbitrary bursty contentious workloads. These are the things that make it complex w.r.t managing IO resources.
buybackoff超过 2 年前
Having dealt with several data parsers I would like to somehow estimate how much electricity is burned globally just for lazy implementations. E.g. in .NET non-devirtualized `Stream.ReadByte` is often one of the hottest methods in a profiler. It and related methods could easily be responsible for double-digit share of CPU when processing data. I mean, it&#x27;s not IO but just pure overheads that disappear with custom buffering where reading a single byte is as cheap as it should be.
up2isomorphism超过 2 年前
The correct way to describe his experiment should be:<p>Of course I&#x2F;O is the slowest, but it is fast enough to let most of the programmers not able to fully utilize it.
nkozyra超过 2 年前
&gt; Some candidates say that sorting is going to be the bottleneck, because it’s O(N log N) rather than the input processing, which is O(N). However, it’s easy to forget we’re dealing with two different N’s: the total number of words in the file, and the number of unique words.<p>I don&#x27;t see how that changes anything. There&#x27;s a reason we use Big O rather than other notations. Their answer would still be correct.
评论 #33753069 未加载
评论 #33753097 未加载
评论 #33752961 未加载
评论 #33753722 未加载
slotrans超过 2 年前
Article itself is fine but the &quot;conclusion&quot; is loony. You can&#x27;t draw conclusions about big data from toy experiments with small data.
sriku超过 2 年前
On a related note, John Ousterhout (in the RAMCloud project) was basically betting that the latency of accessing RAM on another computer on a fast local network will eventually become competitive to local RAM access.<p><a href="https:&#x2F;&#x2F;ramcloud.atlassian.net&#x2F;wiki&#x2F;spaces&#x2F;RAM&#x2F;overview" rel="nofollow">https:&#x2F;&#x2F;ramcloud.atlassian.net&#x2F;wiki&#x2F;spaces&#x2F;RAM&#x2F;overview</a>
alecco超过 2 年前
Nonsense. Latency matters. NVMe latency is measured in microseconds, while DRAM latency is measured in nanoseconds.<p>Sequential processing is not that common.
评论 #33753197 未加载
btbuildem超过 2 年前
As it often goes with these types of interview questions, there&#x27;s a lot of context missing. What is the goal? Do we want readable code? Do we want fast code? Are we constrained somehow? It seems here the author doesn&#x27;t really know, but kudos to them for examining their assumptions.<p>As a side note, a trie would be a neat data structure to use in a solution for this toy problem.
joshspankit超过 2 年前
In the last few years I’ve been quietly moving database “rows” from databases to the disk.<p>Back in the day accessing data from MySQL was actually <i>slower</i> than current SSD speeds. And now you can get all sorts of benefits for free: hard link deduplication, versioning, live backup, easy usage of GNU tools...<p>I don’t discuss this with certain types of colleagues, but the results are excellent.
inglor_cz超过 2 年前
Hmm. I just fought with a MariaDB installation that, when set to immediate write to a disk, became rather slow. 7-8 INSERTs into a DB could easily take 3 seconds; unfortunately the internal logic of the system didn&#x27;t really lend itself to one INSERT of multiple lines.<p>Once I reconfigured innodb_flush_log_at_trx_commit to 2, the UI started being lightning fast.
dehrmann超过 2 年前
I&#x27;m not surprised. I&#x27;ve seen bulk MySQL reads in Python be CPU-bound. The interesting followup was that parallelizing reads in subprocesses wasn&#x27;t going to help much because I&#x27;d get killed by CPU again when serializing&#x2F;deserializing between processes. I was capped at a ~2x speedup.
Klinky超过 2 年前
The Samsung PM9A1 is the OEM version of the 980 Pro, a top-tier PCIe 4.0 NVME SSD. What about an older SATA SSD(one without DRAM buffer or HMB), or a 5400RPM hard drive? Also as others have pointed out, random I&#x2F;O will tank perf, especially simultaneous r&#x2F;w operations to the same media.
andrew-ld超过 2 年前
Honestly I find this article too vague, and if you process large amounts of data you rarely do so with orderly reads and writes, even with databases optimized for fast disks (see rocksdb) have disks as a bottleneck even with the most recently developed hardware.
ThinkBeat超过 2 年前
Is there any hardware accelerator &#x2F; co processor &#x2F; for the PC that will read a file into RAM autonomously mainframish and notify the OS when the file is fully loaded into memory? (bypassing the CPU entirely)<p>Leaving the CPU to bother with other things during that time.
评论 #33752426 未加载
评论 #33752370 未加载
评论 #33752382 未加载
评论 #33752389 未加载
评论 #33753675 未加载
kgeist超过 2 年前
The most common performance problems I&#x27;ve encountered in our projects are: 1) lack of indexes resulting in extensive table scans 2) I&#x2F;O calls in a loop without batching.<p>I don&#x27;t know if it counts as &quot;I&#x2F;O bottlenecks&quot; or not.
svnpenn超过 2 年前
&gt; converting to lowercase<p>in regards to accuracy, uppercase is the better option:<p><a href="https:&#x2F;&#x2F;stackoverflow.com&#x2F;a&#x2F;65433126" rel="nofollow">https:&#x2F;&#x2F;stackoverflow.com&#x2F;a&#x2F;65433126</a>
评论 #33753168 未加载
323超过 2 年前
Further evidence is the fact that optimized SIMD JSON or UTF8 libraries exist. If I&#x2F;O was the bottleneck, there wouldn&#x27;t be a need to parse JSON using SIMD.
osigurdson超过 2 年前
Compared to L1 cache reference, it certainly still is.
williamcotton超过 2 年前
Wouldn’t it be nice if we could specify the allocators in GC languages? Like, why not expose a bump allocator arena to Python with a manual release?
评论 #33752188 未加载
mikesabbagh超过 2 年前
Usually we mainly run jobs on nfs or similar disks. I&#x2F;O time would be more significant. Would be nice to run thoses tests on aws
mrich超过 2 年前
I have a hunch when rewriting the program in C&#x2F;C++ or Rust this will change significantly.
评论 #33751965 未加载
评论 #33752055 未加载
christophilus超过 2 年前
Does anyone have a good resource for reasoning about how to avoid allocations in JavaScript?
daviddever23box超过 2 年前
The problem here, as with most interview problems, is that it is wholly dissociated from its context; memory contraints, disk I&#x2F;O, and file size are non-trivial considerations. Shame on people for stare-at-one&#x27;s-shoes thinking.
评论 #33752673 未加载
Mave83超过 2 年前
If you have to handle multi million I&#x2F;O&#x27;s or files, or GB&#x2F;s or TB&#x2F;s bandwidth, just use DAOS.io storage. Awesome fast, scale out and self healing.
lotik超过 2 年前
Can we just give it more resources? &#x2F;s
therealbilly超过 2 年前
Does any of this really matter?
furstenheim超过 2 年前
It would be nice to see this benchmark in Node.js with streams
visarga超过 2 年前
Loading 1GB json files is still slow.
评论 #33751916 未加载
评论 #33752678 未加载
评论 #33751877 未加载
评论 #33752200 未加载
ihatepython超过 2 年前
It&#x27;s true, I&#x2F;O is no longer the bottleneck.<p>The bottleneck now is interviewers who think they have the knowledge and expertise, but do not. However their authoritative position ends up distorting everything, and then said person blogs about it and causes even more problems.
评论 #33754489 未加载
评论 #33754131 未加载
评论 #33753850 未加载
chaxor超过 2 年前
It&#x27;s still very bizarre to me to see people completely write off spinning platter drives as &#x27;old tech&#x27;. They&#x27;re still used everywhere! (At least in my world)<p>We have all of my teams data on an array of 18tb drives for a 100TB raid10 setup, and a NAS at home doing the same, etc. Even some of our OS drives at work are 7200rpm drives - and we&#x27;re a computational lab. Why is everyone so intent that these effectively no longer exist? The cost for a decent amount of space with NVME drives is just far too astronomical. We&#x27;re not all millionaires.
评论 #33752894 未加载
评论 #33752739 未加载
评论 #33752740 未加载
评论 #33752717 未加载
评论 #33752721 未加载
评论 #33752945 未加载
评论 #33752712 未加载
评论 #33752722 未加载
评论 #33753202 未加载
Havoc超过 2 年前
Picking something counterintuitive as interview question is not a great idea. Defeats the purpose - harder to tell whether the candidate is going with conventional wisdom because that&#x27;s what they think the answer is, or because the candidate thinks that&#x27;s the answer the interviewer expects.<p>i.e. You could get sharp candidates that know the correct technical answer but intentionally give the wrong one because they rightly concluded statistically odds are the interviewer is on conventional wisdom wavelength.<p>Could be good if you&#x27;re specifically looking for someone good at mindgames I guess
评论 #33753727 未加载
评论 #33753357 未加载