Way back in my 6502 days I entered a competition for the fastest sieve program. My program had self modifying code running on the zero page. To reset the 8K memory the program employed 24k memory for the store instructions.<p>The winning entry left me in the dust - rather than zeroing the memory in the winning program’s array was initialized with a template of the first few primes. There can be solutions that are even faster than the fastest possible zeroing.
I’ve always wondered how much CPU time and memory bandwidth is taken up by the OS zeroing out pages before handing them out, as well as programs and libraries clearing chunks of memory.<p>I guess it’s enough that I’m surprised that there’s no hardware support for the memory system to support a way to handle it by itself on command, without taking up bus bandwidth or CPU cycles. Kind of like old-fashioned REP STOS but handled off-chip, as it were.<p>[Added:] Concerning various instructions for clearing whole cache lines in one go, you still end up with lots of dirty cache lines that have to be sent to L1, L2, ..., RAM (not to mention the stuff that was previously in those cache lines), so there’s still lots of bus bandwidth being consumed.
If you're routinely zeroing this much memory and the performance matters, you might benefit from idle-zeroing it. That is, when you need to zero the massive block, just switch to a different block that has already been zeroed or partly-zeroed in the background. Whatever hasn't already been zeroed, finish synchronously. The background thread doing the zeroing would be scheduled with the lowest priority, so that it only runs when the system otherwise has nothing to do.<p>At first, I thought you might just want to get fresh pages from the kernel (which are always zeroed), but this answer convinced me that might not actually be faster because of the overhead from syscalls and fiddling with virtual memory <a href="https://stackoverflow.com/questions/49896578/fastest-way-to-zero-pages-in-linux" rel="nofollow">https://stackoverflow.com/questions/49896578/fastest-way-to-...</a> . And Linux doesn't idle-zero or pre-zero pages (though I believe there's a switch to enable pre-zeroing for purposes of security hardening), so you're probably gonna end up with the OS doing a synchronous[1] zeroing anyway.<p>[1] Synchronous from when you actually first write to each page. My understanding is that when your process gets new pages, they're all mapped to a special zero-page and set to copy-on-write. So there is still some efficiency here in theory: you don't have a long wait for the entire range to be zeroed all at once and you never have to zero pages that you don't modify.
This goes to show that sometimes it really pays to read the docs: the doc comment of "fill" says "For char types filling contiguous areas of memory, this becomes an inline call to @c memset or @c wmemset"...
Interestingly the std::array::fill member function is identical in the case of int or char, I suppose because there's only one overload of fill and it has to take the element type. No idea if the generated stosq is as fast as built-in memset: <a href="https://godbolt.org/z/4iYGup" rel="nofollow">https://godbolt.org/z/4iYGup</a>
Daniel Lemire compares std::fill in C++ with memset in C in agreement with Travis Down <a href="https://lemire.me/blog/2020/01/20/filling-large-arrays-with-zeroes-quickly-in-c/" rel="nofollow">https://lemire.me/blog/2020/01/20/filling-large-arrays-with-...</a>
another day, another reason I am glad I only know <i>some</i> C++ idioms. If I had a byte-oriented block of data, it would never occur to me to use std::fill() ... because it would never occur to me use std::fill() for anything at all ! :)