Please be aware that the article describes a problem with a specific implementation of THP. Other operating systems implement it differently and don't suffer from the same caveats (though any implementation will of course have its own disadvantages, since THP support requires making various tradeoffs and policy decisions). FreeBSD's implementation (based on [1]) is more conservative and works by opportunistically reserving physically contiguous ranges of memory in a way that allows THP promotion if the application (or kernel) actually makes use of all the pages backed by the large mapping. It's tied in to the page allocator in a way that avoids the "leaks" described in the article, and doesn't make use of expensive scans. Moreover, the reservation system enables other optimizations in the memory management subsystem.<p>[1] <a href="https://www.cs.rice.edu/~druschel/publications/superpages.pdf" rel="nofollow">https://www.cs.rice.edu/~druschel/publications/superpages.pd...</a>
I've had a really bad run-in with transparent hugepage defragmentation. In a workload consisting of many small-ish reductions, my programme spent over 80% of its total running time in <i>pageblock_pfn_to_page</i> (this was on a 4.4 kernel, <a href="https://github.com/torvalds/linux/blob/v4.4/mm/compaction.c#L74-L115" rel="nofollow">https://github.com/torvalds/linux/blob/v4.4/mm/compaction.c#...</a>) and a total of 97% of the total time in hugepage compaction kernel code. Disabling hugepage defrag with <i>echo never > /sys/kernel/mm/transparent_hugepage/defrag</i> lead to an instant 30x performance improvement.<p>There's been some work to improve performance (e.g. <a href="https://github.com/torvalds/linux/commit/7cf91a98e607c2f935dbcc177d70011e95b8faff" rel="nofollow">https://github.com/torvalds/linux/commit/7cf91a98e607c2f935d...</a> in 4.6) but I haven't tried if this fixes my workload.
So glad this is on the front page of HN. A good 30% of perf problems for our clients are low level misconfigurations such as this.
For databases:
huge pages - good
THP - bad
Not to mention that there was a race condition in the implementation which would cause random memory corruption under high memory load. Varnish Cache would consistently hit this. Recently fixed:<p><a href="https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html-single/7.2_release_notes/index#kernel" rel="nofollow">https://access.redhat.com/documentation/en-us/red_hat_enterp...</a>
Agreed. Found this to be a problem and fixed it by switching it off three years ago. Seems to be a bigger problem on larger systems than small systems. We had a 64-core server with 384GB RAM, and running too many JVMs made the khugepaged go into overdrive and basically cripple the server entirely - unresponsive, getting 1% the work done, etc.
I stumbled upon this feature when some Windows VMs running 3D accelerated programs exhibited freezes of multiple seconds every now and then. We quickly discovered khugepaged would hog the CPU completely during these hangs. Disabling THP solved any performance issues.
Bad advise... The following article is much better at actually measuring the impact:<p><a href="https://alexandrnikitin.github.io/blog/transparent-hugepages-measuring-the-performance-impact/" rel="nofollow">https://alexandrnikitin.github.io/blog/transparent-hugepages...</a><p>Especially the conclusion is noteworthy:<p>> Do not blindly follow any recommendation on the Internet, please! Measure, measure and measure again!
Transparent hugepages causes a massive slowdown on one of my systems. It has 64GB of RAM, but it seems the kernel allocator fragments under my workload after a couple of days, resulting in very few >2MB regions free (as per proc buddyinfo) even with >30GB of free ram. This slowed down my KVM boots dramatically (10s -> minutes), and perf top looked like the allocator was spending a lot of cycles repeatedly trying and failing to allocate huge pages.<p>(I don't want to preallocate hugepages because KVM is only a small part of my workload.)
Shouldn't huge pages be used automatically if you malloc() large amounts of memory at once? Wouldn't that cover some of the applications that benefit from it?