There are two dedicated CPU feature flags to indicate that REP STOS/MOV are fast and usable as short instruction sequence for memset/memcpy.
Having to hand-roll optimized routines for each new CPU generation has been an ongoing pain for decades.<p>And yet here we are again. Shouldn't this be part of some timing testsuite of CPU vendors by now?
I was prepared to read the article and scoff at the author's misuse of std::fs. However, the article is a delightful succession of rabbit holes and mysteries. Well written and very interesting!
I'm a bit confused about the premise. This is not comparing pure Python code against some native (C or Rust) code. It's comparing one Python wrapper around native code (Python's file read method) against another Python wrapper around some native code (OpenDAL). OK it's still interesting that there's a difference in performance, but it's very odd to describe it as "slower than Python". Did they expect that the Python standard library is all written in pure Python? On the contrary, I would expect the implementations of functions in Python's standard library to be native and, individually, highly optimised.<p>I'm not surprised the conclusion had something to do with the way that native code works. Admittedly I was surprised at the specific answer - still a very interesting article despite the confusing start.<p>Edit: The conclusion also took me a couple of attempts to parse. There's a heading "C is slower than Python with specified offset". To me, as a native English speaker, this reads as "C is slower (than Python) with specified offset" i.e. it sounds like they took the C code, specified the same offset as Python, and then it's still slower than Python. But it's the opposite: once the offset from Python was also specified in the C code, the C code was then faster. Still very interesting once I got what they were saying though.
The article itself is a great read and it has fascinating info related to this issue.<p>However I am more interested/concerned about another part. How the issue is reported/recorded and how the communications are handled.<p>Reporting is done over discord, which is a proprietary environment which is not indexed, or searchable. Will not be archived.<p>Communications and deliberations are done over discord and telegram, which is probably worse than discord in this context.<p>This blog post and the github repository is the lingering remains of them. If Xuanwo did not blog this. It would be lost in timeline.<p>Isn't this fascinating?
So the obvious thing to do... Send a patch to change the "copy_user_generic" kernel method to use a different memory copying implementation when the CPU is detected to be a bad one and the memory alignment is one that triggers the slowness bug...
jemalloc was Rust's default allocator till 2018.<p><a href="https://internals.rust-lang.org/t/jemalloc-was-just-removed-from-the-standard-library/8759" rel="nofollow noreferrer">https://internals.rust-lang.org/t/jemalloc-was-just-removed-...</a>
> Rust developers might consider switching to jemallocator for improved performance<p>I am curious if this is something that everyone can do to get free performance or if there are caveats. Can C codebases benefit from this too? Is this performance that is simply left on table currently?
AMD's string store is not like Intel's. Generally, you don't want to use it until you are past the CPU's L2 size (L3 is a victim cache), making ~2k WAY too small. Once past that point, it's profitable to use string store, and should run at "DRAM speed". But it has a high startup cost, hence 256bit vector loads/stores should be used until that threshold is met.
BTW, I've always thought Python uses way too many syscalls when working with files. Simple code like this uses something like 9 syscalls (shown in the article):<p><pre><code> with open('myfile') as f:
data = f.read()
</code></pre>
I'm not much of a C programmer myself. but I at least reported part of the issue to Python: <a href="https://bugs.python.org/issue45944" rel="nofollow noreferrer">https://bugs.python.org/issue45944</a><p>This is the fastest way to read a file on python that I've found, using only 3-4 syscalls (though os.fstat() doesn't work for some special files kernel files like those in /proc/ and /dev/):<p><pre><code> def read_file(path: str, size=-1) -> bytes:
fd = os.open(path, os.O_RDONLY)
try:
if size == -1:
size = os.fstat(fd).st_size
return os.read(fd, size)
finally:
os.close(fd)</code></pre>
Delightful article. Thank you author for sharing! I felt like I experienced every shock twist in surprise in your journey like I was right there with you all along.
A related thing from times when it was common that memory layout artifacts had high impact on sw performance: <a href="https://en.wikipedia.org/wiki/Cache_coloring" rel="nofollow noreferrer">https://en.wikipedia.org/wiki/Cache_coloring</a>
Disclaimer: The title has been changed to "Rust std fs slower than Python!? No, it's hardware!" to avoid clickbait. However I'm not able to fix the title in HN.
> However, mmap has other uses too. It's commonly used to allocate large regions of memory for applications.<p>Slack is allocating 1132 GB of virtual memory on my laptop right now. I don't know if they are using mmap but that's 1100 GB more than the physical memory.
Anyone else feeling the frequency illusion with rep movsb?<p>(<a href="https://lock.cmpxchg8b.com/reptar.html" rel="nofollow noreferrer">https://lock.cmpxchg8b.com/reptar.html</a>)
Either the author changed the headline to something less clickbaity in the meantime or you edited it for clickbait Pop_- (in that case: shame on you) - current headline: "Rust std fs slower than Python!? No, it's hardware!"
Totally unrelated but: this post talks about the bug being first discovered in OpenDAL [1], which seems to be an Apache (Incubator) project to add an abstraction layer for storage over several types of storage backend. What's the point/use case of such an abstraction? Anybody using it?<p>[1] <a href="https://opendal.apache.org/" rel="nofollow noreferrer">https://opendal.apache.org/</a>
>Rust std fs slower than Python!? No, it's hardware!<p>>...<p>>Python features three memory domains, each representing different allocation strategies and optimized for various purposes.<p>>...<p>>Rust is slower than Python only on my machine.<p>if one library performs wildly better than the other in the same test, on the same hardware, how can that not be a software-related problem? sounds like a contradiction.<p>Maybe should be considered a coding issue and/or feature absent? IMHO it would be expected Rust's std library perform well without making all the users to circumvent the issue manually.<p>The article is well investigated so I assume the author just want to show the problem existence without creating controversy because other way I can not understand.