Linux Pipes Are Slow

340 点作者 qsantos9 个月前

24 条评论

One of my sideprojects is intended to address this: <a href="https://lwn.net/Articles/976836/" rel="nofollow">https://lwn.net/Articles/976836/</a>The idea is a syscall for getting a ringbuffer for any supported file descriptor, including pipes - and for pipes, if both ends support using the ringbuffer they'll map the same ringbuffer: zero copy IO, potentially without calling into the kernel at all.Would love to find collaborators for this one :)

评论 #41358204 未加载

评论 #41355426 未加载

评论 #41356779 未加载

评论 #41365457 未加载

评论 #41352565 未加载

fatcunt9 个月前

> I do not know why the JMP is not just a RET, however.This is caused by the CONFIG_RETHUNK option. In the disassembly from objdump you are seeing the result of RET being replaced with JMP __x86_return_thunk.<a href="https://github.com/torvalds/linux/blob/v6.1/arch/x86/include/asm/linkage.h#L23">https://github.com/torvalds/linux/blob/v6.1/arch/x86/include...</a><a href="https://github.com/torvalds/linux/blob/v6.1/arch/x86/lib/retpoline.S#L121">https://github.com/torvalds/linux/blob/v6.1/arch/x86/lib/ret...</a>> The NOP instructions at the beginning and at the end of the function allow ftrace to insert tracing instructions when needed.These are from the ASM_CLAC and ASM_STAC macros, which make space for the CLAC and STAC instructions (both of them three bytes in length, same as the number of NOPs) to be filled in at runtime if X86_FEATURE_SMAP is detected.<a href="https://github.com/torvalds/linux/blob/v6.1/arch/x86/include/asm/smap.h#L17-L26">https://github.com/torvalds/linux/blob/v6.1/arch/x86/include...</a><a href="https://github.com/torvalds/linux/blob/v6.1/arch/x86/include/asm/cpufeatures.h#L264">https://github.com/torvalds/linux/blob/v6.1/arch/x86/include...</a><a href="https://github.com/torvalds/linux/blob/v6.1/arch/x86/kernel/alternative.c#L265">https://github.com/torvalds/linux/blob/v6.1/arch/x86/kernel/...</a>

评论 #41360352 未加载

评论 #41356583 未加载

0xbadcafebee9 个月前

Calling Linux pipes "slow" is like calling a Toyota Corolla "slow". It's fast enough for all but the most extreme use cases. Are you racing cars? In a sport where speed is more important than technique? Then get a faster car. Otherwise stick to the Corolla.

评论 #41354374 未加载

评论 #41354575 未加载

评论 #41353290 未加载

评论 #41359695 未加载

评论 #41356244 未加载

评论 #41354553 未加载

评论 #41360141 未加载

评论 #41356342 未加载

评论 #41356637 未加载

评论 #41356226 未加载

JoshTriplett9 个月前

This is a side note to the main point being made, but on modern CPUs, "rep movsb" is just as fast as the fastest vectorized version, because the CPU knows to accelerate it. The name of the kernel function "copy_user_enhanced_fast_string" hints at this: the CPU features are ERMS ("Enhanced Repeat Move String", which makes "rep movsb" faster for anything above a certain length threshold) and FSRM ("Fast Short Repeat Move String", which makes "rep movsb" faster for shorter moves too).

评论 #41352733 未加载

评论 #41352607 未加载

评论 #41352794 未加载

评论 #41362464 未加载

donaldihunter9 个月前

Something I didn't see mentioned in the article about AVX512, aside from the xsave/xrstor overhead, is that AVX512 is power hungry and causes CPU frequency scaling. See [1], [2] for details and as an example of how nuanced it can get.[1] <a href="https://www.intel.com/content/dam/www/central-libraries/us/en/documents/cryptography-processing-with-3rd-gen-intel-xeon-scalable-processors-19-may-2021.pdf" rel="nofollow">https://www.intel.com/content/dam/www/central-libraries/us/e...</a>[2] <a href="https://www.intel.com/content/www/us/en/developer/articles/technical/accelerating-x265-with-intel-advanced-vector-extensions-512-intel-avx-512.html" rel="nofollow">https://www.intel.com/content/www/us/en/developer/articles/t...</a>

评论 #41357772 未加载

nitwit0059 个月前

Just about every form of IPC is "slow". You have decided to pay a performance cost for safety.

评论 #41353203 未加载

评论 #41353840 未加载

qsantos9 个月前

I am again getting the hug of death of Hacker News. The situation is better than the last time thanks to caching WordPress pages, but loading the page can still take a few seconds, so bear with me!

RevEng9 个月前

I didn't quite grasp why the original splice has to be so slow. They pointed out what made it slower than vmsplice - in particular allocating buffers and using scalar instructions - but why is this necessary? Why couldn't splice just be reimplemented as vmsplice? I'm sure there is a good reason, but I've missed it.

评论 #41353314 未加载

rwmj9 个月前

Be interesting to see a version using io_uring, which I think would let you pre-share buffers with the kernel avoiding some copies, and avoid syscall overhead (though the latter seems negligible here).

评论 #41356556 未加载

stabbles9 个月前

A bold claim for a blog that takes about 20 seconds to load.

评论 #41355438 未加载

评论 #41356302 未加载

评论 #41355929 未加载

Borg39 个月前

Haha. When I read the title I smiled. Linux pipes slow? Moook.. Now try Cygwin pipes. Thats what I call slow!Anyway, nice article, its good to know whats going on under the hood.

评论 #41355139 未加载

faizshah9 个月前

This is a really cool post and that is a massive amount of throughput.In my experience in data engineering, it’s very unlikely you can exceed 500mb/s throughput of your business logic as most libraries you’re using are not optimized to that degree (SIMD etc.). That being said I think it’s a good technique to try out.I’m trying to think of other applications this could be useful for. Maybe video workflows?

sixthDot9 个月前

> I do not know why the JMP is not just a RET, however.The jump seems generated by the expansion of the `ASM_CLAC` macro, which is supposed to change the EFLAGS register ([1], [2]). However in this case the expansion looks like it does nothing (maybe because of the target ?). I 'd be interested to know more about that. Call to the wild.[1]: <a href="https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/smap.h#L17">https://github.com/torvalds/linux/blob/master/arch/x86/inclu...</a>[2]: <a href="https://stackoverflow.com/a/60579385" rel="nofollow">https://stackoverflow.com/a/60579385</a>

yencabulator9 个月前

FUSE can be a bit trickier than a single queue of data chunks. Reads from /dev/fuse actually pick the right message to read based on priorities, and there's cases where the message queue is meddled with to e.g. cancel requests before they're even sent to userspace. If you naively switch it to eagerly putting messages into a userspace-visible ringbuffer, you might significantly change behavior in cases like interrupting slow operations. Imagine having to fulfill a ringbuf worth of requests to a misbehaving backend taking 5sec/op, just to see the cancellations at the very end.

nyanpasu649 个月前

How do you gather profiling information for kernel function calls from a user program?

评论 #41354428 未加载

jvanderbot9 个月前

> Although SSE2 is always available on x86-64, I also disabled the cpuid bit for SSE2 and SSE to see if it could nudge glibc into using scalar registers to copy data. I immediately got a kernel panic. Ah, well.I think you need to recompile your compiler, or disable those explicitly via link / cc flags. Compilers are fairly hard to get to coax / dissuade SIMD instructions, IMHO.

arendtio9 个月前

I know pipes primarily from shell scripts. Are they being used in other contexts as extensively, too? Like C or Rust programs?

评论 #41365553 未加载

up2isomorphism9 个月前

Someone tasted a bread thinking it is not sweet enough, which is fine. But calling the bread bland is funny because it does not mean to taste sweet.

jeremyscanvic9 个月前

Great post! I didn't know about vmsplice(2). I'm glad to see a former ENSL student here as well!

评论 #41360055 未加载

goodpoint9 个月前

Excellent article even if, to be honest, the title is clickbait.

评论 #41356162 未加载

mparnisari9 个月前

I get PR_CONNECT_RESET_ERROR when trying to open the page

评论 #41358609 未加载

cowsaymoo9 个月前

What is the library used to profile the program?

评论 #41355934 未加载

djaouen9 个月前

So is Python, but I'm still gonna use it lol

jheriko9 个月前

just never use pipes. they are some weird archaism that need to die :Pthe only time ive used them is external constraints. they are just not useful.

评论 #41351870 未加载

评论 #41352217 未加载

评论 #41351875 未加载

评论 #41351642 未加载

评论 #41353953 未加载