Huge pages are a good idea

175 pointsby moreatiover 2 years ago

19 comments

aseippover 2 years ago

Huge pages are an absolutely great idea. I will continue to complain about our eternal commitment to 4k pages, with all their pitfalls, which it appears we may be stuck with until the heat death of the universe. At the very minimum we could go to 16k pages, which are good for many reasons including, in particular, being able to have bigger VIPT cache sizes without needing an increase in associativity (and thus latency). Not the end of the world but a very solid win on top of the TLB wins.But transparent hugepages continue to be a massive source of bugs, weird behaviors, and total system failures in my experience. I just got a bug report this week where a simple THP enabled system simply spun out of control with a kernel task locking the system at 100% CPU for minutes, with a 10 line reproducer via mmap(2). This is in combination with qemu/libvirt in a virtual machine, and it's possible the virtualization stack is just exposing bugs, but, like. This is very well tested stuff! I'm not sure "Google enabled it fleet wide, so it can be done" is very re-assuring to me when most of us don't have fleetops/infra/kernel teams capable of dealing with this stuff. The person who reported this bug said they started seeing odd behavior a month ago, before boiling it down; it wasn't readily apparent at all. Is this just a massive footgun for our distro users? I dunno. Something that works in the p95 and then collapses horrifically in the p99 cases like this doesn't feel great. I try not to be superstitious about things like this but, yeah. It's weird.Anyway. This reminds me I have to submit some patches to disable jemalloc in a few aarch64 packages so I can use them on Asahi Linux. 4k pages will haunt us until the end of time.

评论 #34479247 未加载

评论 #34481360 未加载

评论 #34479947 未加载

mastaxover 2 years ago

I recently set up huge pages on my database server (MariaDB and Postgres) the recommended way, which required way too much rigamarole IMO. Add to the kernel command line to statically allocate a number of huge pages of a certain size. Create a group to access huge pages. Configure that group to access huge pages. Add mysqld etc to the group. Configure the huge pages to be mounted as a virtual filesystem in /dev/ for some reason. Add corresponding configuration to the database to tell it to use huge pages and where to get them.This should all just be a single boolean flag in the database config telling it to use huge pages which it gets from mmap dynamically. Why is any of the filesystem, permission, static allocation malarkey necessary?

评论 #34478382 未加载

评论 #34484271 未加载

phkahlerover 2 years ago

I remember when hard drives moved to 4k sectors. It seemed insane that they were still 512bytes, and seemed neat to have them the same size as memory pages. But 4k pages seem incredibly small today, and I would argue that any application suffering with larger pages is doing something wrong. To have performance problems you need to use large amounts of memory with small allocations right? Not only that, but also free some to cause fragmentation?It's strange to me that this is an issue so late in the game.

评论 #34478099 未加载

smcameronover 2 years ago

After seeing this, I spent a half hour or so this morning and I was able to implement this in one of my programs, and now it runs to completion in about 85 percent of the time that it used to require. So, thanks!

评论 #34482006 未加载

bagelsover 2 years ago

• The Linux kernel's implementation of transparent huge pages has been the source of performance problems.I remember (admittedly years ago) spending a lot of time trying to debug server crippling performance problems, ultimately learning that transparent huge pages were the cause.Proceed with caution.

评论 #34476749 未加载

fweimerover 2 years ago

It's funny how this moves in circles: Automatic merging into hugepages (THP) is added to the kernel. Some workloads suffer, so it gets turned off again in some distributions (not all of course). But for many different workloads (some say the majority), THP is actually really, really beneficial, so it is ported to more and more mallocs over time.It might have been more straightforward to teach the problematic workloads to opt out of THP once the problems were discovered.

评论 #34480547 未加载

Genboxover 2 years ago

I've built an efficient in-memory database. It has to be very low latency on batched queries. When I couldn't squeeze more perf out of it myself, I tried enabling large page support (dotnet on Windows) and instantly got a 10% perf increase.It was not at all a pleasant experience due to lack of documentation as well as the flaky implementation. But I was surprised by how much overhead the TLB accounted for.

zokierover 2 years ago

Sounds like lot of issues stem from the transparent aspect of huge pages; that all (un-)mappings are not rounded to huge page size and 4k pages are still supported. Has there been any consideration towards non-transparent huge pages where all that magic does happen and all you got are huge pages?

评论 #34476480 未加载

earhartover 2 years ago

NB This is even more important in VM scenarios, where second-level address translation means something needs to walk a guest-physical-to-system-physical map for each level of the guest-virtual-to-guest-physical map. So TLB locality becomes even more important, and using huge pages cuts down on a multiplier in resolving TLB misses.

gnufxover 2 years ago

People seem to be assuming that only malloc'ed memory is relevant. At least for Fortran, you want to allocate arrays on the stack (gfortran -fstack-arrays).Apart from using larger/huge pages, you may want to take steps to minimize TLB misses. The Goto BLAS paper talks about that.

fulafelover 2 years ago

It's mind warping terminology to call them huge pages. Memory sizes have increased 2000-4000x since x86 Linux picked 4 kB, so 2 MB pages are still in relative terms smaller than 4 kB pages were back then and it should be a no-brainer to use them by default.

评论 #34477454 未加载

评论 #34480363 未加载

评论 #34477376 未加载

signa11over 2 years ago

dupe: <a href="https://news.ycombinator.com/item?id=34450032" rel="nofollow">https://news.ycombinator.com/item?id=34450032</a>

zajio1amover 2 years ago

Problem with transparent huge pages is that it is pushed on applications that were developed with expectation that memory management is based on pages of fixed size, and now they run on system that reports it uses 4k pages, but sometimes substitutes 2M pages instead of 4k pages, in not-really-transparent manner, causing unexpected memory consumption issues. If default for THP were madvise-only, so THP-aware applications can use it, it would not cause such problems.

jeffbeeover 2 years ago

The easiest way to exploit THP, by far, is to link your program against TCMalloc and forget about it. Literally free money. Highly recommended.<a href="https://github.com/google/tcmalloc">https://github.com/google/tcmalloc</a>

zelphirkaltover 2 years ago

This is usually below the level of abstraction I am working on. I have questions. Before madvise, did people simply assume that a memory page is always 4kiB in size and built that assumption into so many programs? Is that why many programs break? Why did they assume that? Was there at least something like "size of int" or so around before madvise? And if so, why did they not use that?

评论 #34477433 未加载

评论 #34478863 未加载

评论 #34476491 未加载

yagizdegirmenciover 2 years ago

There is also a great blog post that covers reliably allocating huge pages: <a href="https://mazzo.li/posts/check-huge-page.html" rel="nofollow">https://mazzo.li/posts/check-huge-page.html</a>

devchixover 2 years ago

** EXCEPT FOR SPLUNK ** (or applications which perform I/O on small bits of data)<a href="https://docs.splunk.com/Documentation/Splunk/9.0.3/ReleaseNotes/SplunkandTHP" rel="nofollow">https://docs.splunk.com/Documentation/Splunk/9.0.3/ReleaseNo...</a>I have notes-to-myself which I can't find at the moment, but the TL:DR is know what you're doing. If you disable it make a note so if the server is re-purposed you don't hamstring an unwitting inheriting admin. I was up against the wall on a Linux indexer which was for mysterious reasons under-performing by a ridiculous percentage, all kinds of crazy latency, until I disabled the THP. This was years ago. If you disable, make it a systemd service so it can be discovered, people don't check /sys/kernel/ as a frequent place to look.Edited: THP not TLB (the translation look-aside buffer).

zajio1amover 2 years ago

If your application breaks due to THP, do not despair! There is prctl(PR_SET_THP_DISABLE), allowing it to be disabled on per-process basis.

compressedgasover 2 years ago

The original title was: Huge Pages are a Good Idea

评论 #34439621 未加载

评论 #34450432 未加载

19 comments

aseippover 2 years ago

评论 #34479247 未加载

评论 #34481360 未加载

评论 #34479947 未加载

mastaxover 2 years ago

评论 #34478382 未加载

评论 #34484271 未加载

phkahlerover 2 years ago

评论 #34478099 未加载

smcameronover 2 years ago

评论 #34482006 未加载

bagelsover 2 years ago

评论 #34476749 未加载

fweimerover 2 years ago

评论 #34480547 未加载

Genboxover 2 years ago

zokierover 2 years ago

评论 #34476480 未加载

earhartover 2 years ago

gnufxover 2 years ago

fulafelover 2 years ago

评论 #34477454 未加载

评论 #34480363 未加载

评论 #34477376 未加载

signa11over 2 years ago

dupe: <a href="https://news.ycombinator.com/item?id=34450032" rel="nofollow">https://news.ycombinator.com/item?id=34450032</a>

zajio1amover 2 years ago

jeffbeeover 2 years ago

zelphirkaltover 2 years ago

评论 #34477433 未加载

评论 #34478863 未加载

评论 #34476491 未加载

yagizdegirmenciover 2 years ago

There is also a great blog post that covers reliably allocating huge pages: <a href="https://mazzo.li/posts/check-huge-page.html" rel="nofollow">https://mazzo.li/posts/check-huge-page.html</a>

devchixover 2 years ago

zajio1amover 2 years ago

If your application breaks due to THP, do not despair! There is prctl(PR_SET_THP_DISABLE), allowing it to be disabled on per-process basis.

compressedgasover 2 years ago

The original title was: Huge Pages are a Good Idea

评论 #34439621 未加载

评论 #34450432 未加载