The “too small to fail” memory-allocation rule

150 点作者 kakakiki超过 10 年前

10 条评论

willvarfar超过 10 年前

Somewhat related, the classic "Respite from the OOM killer" by Andries Brouwer:An aircraft company discovered that it was cheaper to fly its planes with less fuel on board. The planes would be lighter and use less fuel and money was saved. On rare occasions however the amount of fuel was insufficient, and the plane would crash. This problem was solved by the engineers of the company by the development of a special OOF (out-of-fuel) mechanism. In emergency cases a passenger was selected and thrown out of the plane. (When necessary, the procedure was repeated.) A large body of theory was developed and many publications were devoted to the problem of properly selecting the victim to be ejected. Should the victim be chosen at random? Or should one choose the heaviest person? Or the oldest? Should passengers pay in order not to be ejected, so that the victim would be the poorest on board? And if for example the heaviest person was chosen, should there be a special exception in case that was the pilot? Should first class passengers be exempted? Now that the OOF mechanism existed, it would be activated every now and then, and eject passengers even when there was no fuel shortage. The engineers are still studying precisely how this malfunction is caused.<a href="http://lwn.net/Articles/104179/" rel="nofollow">http://lwn.net/Articles/104179/</a>

评论 #8934175 未加载

erlkonig超过 10 年前

Enabling overcommit machine-wide is a puerile, broken approach that not only converts your server to an unreliable toy, but encourages other idiots to rely on the same broken behavior in their libraries, language implementations, and so forth, basically leading the current plethora of collection libraries that don't even bother to monitor their own memory use or check malloc's return. It is software engineering plague, a rot on the underbelly of allegedly-solid code. oomkiller's unpredictability causes any number of problems in actual production environments, usually by killing the wrong process, and secretly ripping the stability out of programs whose code does check malloc's return. The answer is:{ echo 'vm.overcommit_memory = 2' ; echo 'vm.overcommit_ratio = 100' ; } >/etc/sysctl.d/10-no-overcommit.confWhich restores classical semantics and allow processes to identify memory allocation failures and respond to them responsibly in a number of ways (garbage collect being an obvious one, clean, safe exits after logging being another).Now, if we could say that a specific process was allowed to overcommit because we could guarantee it would use the bogus memory allocation, then we'd have something vaguely useful.

评论 #8934166 未加载

评论 #8934380 未加载

评论 #8935156 未加载

fit2rule超过 10 年前

Its not exactly true that these error-recovery paths are untested - in the context of the broader collective it can be said that there is no certainty.But the Linux kernel has been used in countless industries requiring precisely that level of testing. I myself have been involved in SIL-4 certification of embedded Linux kernels for the transportation industry, and we ran into this memory-alloc issue years ago; its been quite widely understood already, and accommodated by the extremely rigorous testing thats required to get the Linux kernel in use in places where human lives are on the line.So what I would suggest anyone working on this issue do, is contact the folks who are using the Linux kernel in the SIL-4 context, and try to get support on releasing the tests that have been developed to exercise exactly this issue. Its not a new issue - all safety kernels have to be tested and certified (and have 100% code coverage completion) on the subject of out-of-memory conditions, and if this is not done there is no way that Linux can be used. Fact is, in 38+ countries around the world, the Linux kernel is keeping the trains on the rails already - the work has been done. Its maybe just not open/obvious to the LWN collective, as is often the case.

评论 #8934352 未加载

评论 #8934190 未加载

jakub_g超过 10 年前

> But it is worse than that: since small allocations do not fail, almost none of the thousands of error-recovery paths in the kernel now are ever exercised.I've started noticing the similar thing with Firefox a year or two ago. Probably no one is heavily testing browser's behavior in low mem situation.Basically in low memory conditions, things are going crazy. Apart from low responsiveness, there is stuff happening like very strange rendering artifacts and occasional browser cache corruption.The manifestation of the last one was pretty funny once for me, I started a chess-like game (figures were rendered as PNG images) and the computer had multiple kings and rooks ;) Took me a while to figure out the issue was on browser's my side.

评论 #8934478 未加载

IgorPartola超过 10 年前

I know! In this case the OOM killer should kill the process that requested the XFS operation in the first place! To avoid deadlocks it should just KILL it not TERM it. I don't see any problems with that solution :).In all seriousness, wow. This is the type of thing really must hurt. It'll be interesting to see which path they choose.

评论 #8934218 未加载

jkot超过 10 年前

Interesting. I had similar problem with recursive memory allocation while working on database engine. Solution was relatively simple, reorder method calls inside alocator, so that memory is allocated BEFORE cleanup progresses.I think Linux memory allocator devs could keep small preallocated buffer, return allocated space, and schedule independent maintenance after buffer gets low.

评论 #8935146 未加载

angersock超过 10 年前

What is the BSD answer to the OOM killer? Doesn't have one, right?

评论 #8933700 未加载

评论 #8933862 未加载

评论 #8934760 未加载

zqfm超过 10 年前

My first thought is the kernel should pre-allocate some space for running a recovery/cleanup/analysis process when malloc fails. Is anything like this done already? Can it defer to the user to decide what to do when that happens?

评论 #8937005 未加载

评论 #8937006 未加载

iopq超过 10 年前

This thread is hilarious. That Ido guy keeps posting his do /once/ while (false); loops and ignoring everyone who tells him that's a horrible replacement for the goto.

raldi超过 10 年前

Couldn't the filesystem code release its locks before calling the OOM killer, then reacquire them?