Rust std fs slower than Python? No, it's hardware

687 pointsby Pop_-over 1 year ago

28 comments

the8472over 1 year ago

There are two dedicated CPU feature flags to indicate that REP STOS/MOV are fast and usable as short instruction sequence for memset/memcpy. Having to hand-roll optimized routines for each new CPU generation has been an ongoing pain for decades.And yet here we are again. Shouldn't this be part of some timing testsuite of CPU vendors by now?

评论 #38471680 未加载

评论 #38462455 未加载

评论 #38467366 未加载

Aissenover 1 year ago

Associated glibc bug (Zen 4 though): <a href="https://sourceware.org/bugzilla/show_bug.cgi?id=30994" rel="nofollow noreferrer">https://sourceware.org/bugzilla/show_bug.cgi?id=30994</a>

评论 #38462707 未加载

评论 #38461986 未加载

royjacobsover 1 year ago

I was prepared to read the article and scoff at the author's misuse of std::fs. However, the article is a delightful succession of rabbit holes and mysteries. Well written and very interesting!

评论 #38460958 未加载

评论 #38461893 未加载

评论 #38460649 未加载

quietbritishjimover 1 year ago

I'm a bit confused about the premise. This is not comparing pure Python code against some native (C or Rust) code. It's comparing one Python wrapper around native code (Python's file read method) against another Python wrapper around some native code (OpenDAL). OK it's still interesting that there's a difference in performance, but it's very odd to describe it as "slower than Python". Did they expect that the Python standard library is all written in pure Python? On the contrary, I would expect the implementations of functions in Python's standard library to be native and, individually, highly optimised.I'm not surprised the conclusion had something to do with the way that native code works. Admittedly I was surprised at the specific answer - still a very interesting article despite the confusing start.Edit: The conclusion also took me a couple of attempts to parse. There's a heading "C is slower than Python with specified offset". To me, as a native English speaker, this reads as "C is slower (than Python) with specified offset" i.e. it sounds like they took the C code, specified the same offset as Python, and then it's still slower than Python. But it's the opposite: once the offset from Python was also specified in the C code, the C code was then faster. Still very interesting once I got what they were saying though.

评论 #38466106 未加载

评论 #38462251 未加载

评论 #38461311 未加载

评论 #38459136 未加载

评论 #38465585 未加载

评论 #38471075 未加载

fsniperover 1 year ago

The article itself is a great read and it has fascinating info related to this issue.However I am more interested/concerned about another part. How the issue is reported/recorded and how the communications are handled.Reporting is done over discord, which is a proprietary environment which is not indexed, or searchable. Will not be archived.Communications and deliberations are done over discord and telegram, which is probably worse than discord in this context.This blog post and the github repository is the lingering remains of them. If Xuanwo did not blog this. It would be lost in timeline.Isn't this fascinating?

评论 #38472732 未加载

评论 #38466918 未加载

iampimsover 1 year ago

Most interesting article I've read this week. Excellent write-up.

londons_exploreover 1 year ago

So the obvious thing to do... Send a patch to change the "copy_user_generic" kernel method to use a different memory copying implementation when the CPU is detected to be a bad one and the memory alignment is one that triggers the slowness bug...

评论 #38460838 未加载

评论 #38467126 未加载

comonoidover 1 year ago

jemalloc was Rust's default allocator till 2018.<a href="https://internals.rust-lang.org/t/jemalloc-was-just-removed-from-the-standard-library/8759" rel="nofollow noreferrer">https://internals.rust-lang.org/t/jemalloc-was-just-removed-...</a>

评论 #38463341 未加载

a1oover 1 year ago

> Rust developers might consider switching to jemallocator for improved performanceI am curious if this is something that everyone can do to get free performance or if there are caveats. Can C codebases benefit from this too? Is this performance that is simply left on table currently?

评论 #38461174 未加载

评论 #38459601 未加载

评论 #38459624 未加载

评论 #38470053 未加载

评论 #38467213 未加载

评论 #38462306 未加载

评论 #38461811 未加载

评论 #38462579 未加载

评论 #38463362 未加载

amlutoover 1 year ago

I sent this to the right people.

评论 #38467299 未加载

diamondlovesyouover 1 year ago

AMD's string store is not like Intel's. Generally, you don't want to use it until you are past the CPU's L2 size (L3 is a victim cache), making ~2k WAY too small. Once past that point, it's profitable to use string store, and should run at "DRAM speed". But it has a high startup cost, hence 256bit vector loads/stores should be used until that threshold is met.

评论 #38462302 未加载

评论 #38462202 未加载

collinmandersonover 1 year ago

BTW, I've always thought Python uses way too many syscalls when working with files. Simple code like this uses something like 9 syscalls (shown in the article):<pre><code> with open('myfile') as f: data = f.read() </code></pre> I'm not much of a C programmer myself. but I at least reported part of the issue to Python: <a href="https://bugs.python.org/issue45944" rel="nofollow noreferrer">https://bugs.python.org/issue45944</a>This is the fastest way to read a file on python that I've found, using only 3-4 syscalls (though os.fstat() doesn't work for some special files kernel files like those in /proc/ and /dev/):<pre><code> def read_file(path: str, size=-1) -> bytes: fd = os.open(path, os.O_RDONLY) try: if size == -1: size = os.fstat(fd).st_size return os.read(fd, size) finally: os.close(fd)</code></pre>

评论 #38480049 未加载

forrestthewoodsover 1 year ago

Delightful article. Thank you author for sharing! I felt like I experienced every shock twist in surprise in your journey like I was right there with you all along.

Pesthufover 1 year ago

Clickbait headline, but the article is great!

评论 #38462023 未加载

评论 #38461687 未加载

fulafelover 1 year ago

A related thing from times when it was common that memory layout artifacts had high impact on sw performance: <a href="https://en.wikipedia.org/wiki/Cache_coloring" rel="nofollow noreferrer">https://en.wikipedia.org/wiki/Cache_coloring</a>

codedokodeover 1 year ago

Why is there need to move memory? Hardware cannot DMA data into non-page-aligned memory? Or Linux doesn't want to load non-aligned data?

评论 #38464738 未加载

titaniumtownover 1 year ago

Extremely well written article! Very surprising outcome.

eigenformover 1 year ago

would be lovely if ${cpu_vendor} would document exactly how FSRM/ERMS/etc are implemented and what the expected behavior is

评论 #38467309 未加载

lxeover 1 year ago

I wonder what other things we can improve by removing spectre mitigations and tuning hugepage, syscall altency, and core affinity

评论 #38467321 未加载

Pop_-over 1 year ago

Disclaimer: The title has been changed to "Rust std fs slower than Python!? No, it's hardware!" to avoid clickbait. However I'm not able to fix the title in HN.

评论 #38460323 未加载

评论 #38460722 未加载

评论 #38460707 未加载

评论 #38459663 未加载

pmontraover 1 year ago

> However, mmap has other uses too. It's commonly used to allocate large regions of memory for applications.Slack is allocating 1132 GB of virtual memory on my laptop right now. I don't know if they are using mmap but that's 1100 GB more than the physical memory.

评论 #38461106 未加载

评论 #38460479 未加载

评论 #38460535 未加载

评论 #38461022 未加载

explodingwaffleover 1 year ago

Anyone else feeling the frequency illusion with rep movsb?(<a href="https://lock.cmpxchg8b.com/reptar.html" rel="nofollow noreferrer">https://lock.cmpxchg8b.com/reptar.html</a>)

评论 #38467336 未加载

评论 #38461901 未加载

sgiftover 1 year ago

Either the author changed the headline to something less clickbaity in the meantime or you edited it for clickbait Pop_- (in that case: shame on you) - current headline: "Rust std fs slower than Python!? No, it's hardware!"

评论 #38458607 未加载

评论 #38458616 未加载

评论 #38458668 未加载

评论 #38461039 未加载

darkwaterover 1 year ago

Totally unrelated but: this post talks about the bug being first discovered in OpenDAL [1], which seems to be an Apache (Incubator) project to add an abstraction layer for storage over several types of storage backend. What's the point/use case of such an abstraction? Anybody using it?[1] <a href="https://opendal.apache.org/" rel="nofollow noreferrer">https://opendal.apache.org/</a>

评论 #38462053 未加载

exxosover 1 year ago

It's the hardware. Of course Rust remains the fastest and safest language and you must rewrite your applications in Rust.

评论 #38463292 未加载

lxeover 1 year ago

So Python isn't affected by the bug because pymalloc performs better on buggy CPUs than jemalloc or malloc?

评论 #38466445 未加载

jokethrowawayover 1 year ago

Clickbait title but interesting article.This has nothing to do with python or rust

drtghover 1 year ago

>Rust std fs slower than Python!? No, it's hardware!>...>Python features three memory domains, each representing different allocation strategies and optimized for various purposes.>...>Rust is slower than Python only on my machine.if one library performs wildly better than the other in the same test, on the same hardware, how can that not be a software-related problem? sounds like a contradiction.Maybe should be considered a coding issue and/or feature absent? IMHO it would be expected Rust's std library perform well without making all the users to circumvent the issue manually.The article is well investigated so I assume the author just want to show the problem existence without creating controversy because other way I can not understand.

评论 #38459176 未加载

评论 #38459391 未加载

28 comments

the8472over 1 year ago

评论 #38471680 未加载

评论 #38462455 未加载

评论 #38467366 未加载

Aissenover 1 year ago

Associated glibc bug (Zen 4 though): <a href="https://sourceware.org/bugzilla/show_bug.cgi?id=30994" rel="nofollow noreferrer">https://sourceware.org/bugzilla/show_bug.cgi?id=30994</a>

评论 #38462707 未加载

评论 #38461986 未加载

royjacobsover 1 year ago

I was prepared to read the article and scoff at the author's misuse of std::fs. However, the article is a delightful succession of rabbit holes and mysteries. Well written and very interesting!

评论 #38460958 未加载

评论 #38461893 未加载

评论 #38460649 未加载

quietbritishjimover 1 year ago

评论 #38466106 未加载

评论 #38462251 未加载

评论 #38461311 未加载

评论 #38459136 未加载

评论 #38465585 未加载

评论 #38471075 未加载

fsniperover 1 year ago

评论 #38472732 未加载

评论 #38466918 未加载

iampimsover 1 year ago

Most interesting article I've read this week. Excellent write-up.

londons_exploreover 1 year ago

评论 #38460838 未加载

评论 #38467126 未加载

comonoidover 1 year ago

评论 #38463341 未加载

a1oover 1 year ago

评论 #38461174 未加载

评论 #38459601 未加载

评论 #38459624 未加载

评论 #38470053 未加载

评论 #38467213 未加载

评论 #38462306 未加载

评论 #38461811 未加载

评论 #38462579 未加载

评论 #38463362 未加载

amlutoover 1 year ago

I sent this to the right people.

评论 #38467299 未加载

diamondlovesyouover 1 year ago

评论 #38462302 未加载

评论 #38462202 未加载

collinmandersonover 1 year ago

评论 #38480049 未加载

forrestthewoodsover 1 year ago

Delightful article. Thank you author for sharing! I felt like I experienced every shock twist in surprise in your journey like I was right there with you all along.

Pesthufover 1 year ago

Clickbait headline, but the article is great!

评论 #38462023 未加载

评论 #38461687 未加载

fulafelover 1 year ago

codedokodeover 1 year ago

Why is there need to move memory? Hardware cannot DMA data into non-page-aligned memory? Or Linux doesn't want to load non-aligned data?

评论 #38464738 未加载

titaniumtownover 1 year ago

Extremely well written article! Very surprising outcome.

eigenformover 1 year ago

would be lovely if ${cpu_vendor} would document exactly how FSRM/ERMS/etc are implemented and what the expected behavior is

评论 #38467309 未加载

lxeover 1 year ago

I wonder what other things we can improve by removing spectre mitigations and tuning hugepage, syscall altency, and core affinity

评论 #38467321 未加载

Pop_-over 1 year ago

Disclaimer: The title has been changed to "Rust std fs slower than Python!? No, it's hardware!" to avoid clickbait. However I'm not able to fix the title in HN.

评论 #38460323 未加载

评论 #38460722 未加载

评论 #38460707 未加载

评论 #38459663 未加载

pmontraover 1 year ago

评论 #38461106 未加载

评论 #38460479 未加载

评论 #38460535 未加载

评论 #38461022 未加载

explodingwaffleover 1 year ago

Anyone else feeling the frequency illusion with rep movsb?(<a href="https://lock.cmpxchg8b.com/reptar.html" rel="nofollow noreferrer">https://lock.cmpxchg8b.com/reptar.html</a>)

评论 #38467336 未加载

评论 #38461901 未加载

sgiftover 1 year ago

评论 #38458607 未加载

评论 #38458616 未加载

评论 #38458668 未加载

评论 #38461039 未加载

darkwaterover 1 year ago

评论 #38462053 未加载

exxosover 1 year ago

It's the hardware. Of course Rust remains the fastest and safest language and you must rewrite your applications in Rust.

评论 #38463292 未加载

lxeover 1 year ago

So Python isn't affected by the bug because pymalloc performs better on buggy CPUs than jemalloc or malloc?

评论 #38466445 未加载

jokethrowawayover 1 year ago

Clickbait title but interesting article.This has nothing to do with python or rust

drtghover 1 year ago

评论 #38459176 未加载

评论 #38459391 未加载