Building a lower-latency GC

156 pointsby yminskyabout 10 years ago

7 comments

jordwalkeabout 10 years ago

Sounds like amazing work is happening here.For real time user applications such as games, newsfeeds, and mobile apps that have visual/audio feedback, I'm certain that the most important feature is the ability to have applications themselves control the process of collecting and the time constraints to operate within. It is possible to have a collector that is less efficient, but be entirely perceived to be more responsive as long as the wasteful or less efficient work is done at times when it does not matter - and this tradeoff would be gladly welcomed.So much of UI application time is idle. Even during fluid animations, the majority of the frame time goes wasted. But then something like Objective-C's ARC frees large trees of memory, or perhaps Java's GC collects, and causes your frame time to exceed 16ms. For UI applications, there are periods of time (during animations) where you must hit 16ms deadlines, and the difference between 1ms and 15ms is nothing, but the difference between 16ms and 17ms is everything. UI application animations are like train schedules. If you miss one train by only a microsecond, you still have to wait the entire time until the next train arrives. You don't get points for almost making the train. Furthermore, only the application can know the train schedule. It isn't something the language runtime can possibly know, so to do this right, the language must anticipate this and accept input from the application frameworks.Then there are other times when we are not performing an animation and we know that we could block a thread for as much as 50ms, without having any perceived delay experienced by the user. The latency constraints for starting a continuous interaction are larger than the constraints for continuing a continuous interaction. So in this case, our application still knows the train schedule, it's just a different train schedule that allows for more time to kill. If applications could tell the GC about this, it might decide that it's a good time to perform a major collection.I've found that many of what people consider to be performance problems in UI applications aren't problems of efficiency, they're problems of scheduling.

评论 #9357529 未加载

评论 #9357873 未加载

评论 #9358104 未加载

评论 #9357525 未加载

评论 #9357511 未加载

评论 #9359959 未加载

kazinatorabout 10 years ago

> False promotions> ... When you do hit the end, you do a minor collection, walking the minor heap to figure out what's still live, and promoting that set of objects to the major heap.I somewhat deal with that issue here:<a href="http://www.kylheku.com/cgit/txr/tree/gc.c" rel="nofollow">http://www.kylheku.com/cgit/txr/tree/gc.c</a>When make_obj exhausts the available FRESHOBJ_VEC_SIZE in the freshobj array (nursery), it first tries to shake things out by doing a partial GC. The partial GC, during the sweep phase, will rebuild the freshobj array with just those gen 0 objects that are reachable. The sweep phase of the incremental GC marches through the freshobj array, and a second index is maintained where live objects are copied back into the array. At the end, that index becomes the new allocation pointer from that array.Only if the array is still full after this do we then have to do a full gc, and that's when any "false promotion" will happen.One problem with this approach is that this freshobj nursery becomes a miniature memory arena in itself; as it gets close to full, these partial garbage collections get more frequent. They are not that cheap.Hmm, it occurs to me I can address that with a one-liner:<pre><code> diff --git a/gc.c b/gc.c index 6832905..7a0ee1c 100644 --- a/gc.c +++ b/gc.c @@ -173,7 +173,7 @@ val make_obj(void) malloc_delta >= opt_gc_delta) { gc(); - if (freshobj_idx >= FRESHOBJ_VEC_SIZE) + if (freshobj_idx >= FRESHOBJ_VEC_SIZE / 2) full_gc = 1; prev_malloc_bytes = malloc_bytes; } </code></pre> I.e. after doing an ephemeral GC, if at least half of the nursery space isn't free, then set the flag for a full gc. (That will happen right away if the nursery has no more objects to satisfy the current request, or else the next time gc is triggered.)This has been a useful HN submission since it got me thinking about something, leading to some findings in a different project. :)

bjourneabout 10 years ago

> typical advice in Java-land is to have a young generation in the 5-10 GiB range, whereas our minor heaps are measured in megabytes.Where does that advice come from? Of what I've seen typical suggested values for the young generation in the jvm and clr is 10-20mb. A young generation measured in gigabytes would defeat the idea of dividing the heap into generations in the first place.Also, unless you have profiled the gc and know exactly in which steps of the process most time is spent, then you are fumbling in the dark.

评论 #9357809 未加载

评论 #9358015 未加载

ameliusabout 10 years ago

It would be extremely useful if GC solutions were built in a more language agnostic way. This would allow language designers to pick a GC that suits their needs, instead of reinventing the wheel every time.

评论 #9359589 未加载

评论 #9359337 未加载

edwintorokabout 10 years ago

Since you brought up Java here is some information that I could find about real-time garbage collection:* JamaicaVM's RTGC seems to be the closest to OCaml's GC, including the 'GC does more work when you allocate more', although the benefit is that threads that don't allocate don't have to run the GC which doesn't apply to OCaml (don't know about Multicore OCaml): <a href="https://www.aicas.com/cms/sites/default/files/Garbage.pdf" rel="nofollow">https://www.aicas.com/cms/sites/default/files/Garbage.pdf</a> <a href="http://www.aicas.com/papers/jtres07_siebert.pdf" rel="nofollow">http://www.aicas.com/papers/jtres07_siebert.pdf</a> It seems they provide a tool (static analysis?) to determine the worst-case allocation time for an application* FijiVM's approach: <a href="http://janvitek.github.io/pubs/pldi10b.pdf" rel="nofollow">http://janvitek.github.io/pubs/pldi10b.pdf</a>* on eliminating pauses during GC and compaction, although with quite invasive changes to the read barrier: <a href="http://www.azulsystems.com/sites/default/files//images/wp_pgc_zing_v5.pdf" rel="nofollow">http://www.azulsystems.com/sites/default/files//images/wp_pg...</a> <a href="http://www.azulsystems.com/sites/default/files/images/c4_paper_acm.pdf" rel="nofollow">http://www.azulsystems.com/sites/default/files/images/c4_pap...</a>Once you have a GC with predictable behaviour you also need a standard library with predictable behaviour (which you have with Core_kernel, right?), and Javolution has an interesting approach, they annotate the public API of the library with time constraints and the compiler can give some warnings: <a href="http://javolution.org/apidocs/javolution/lang/Realtime.html" rel="nofollow">http://javolution.org/apidocs/javolution/lang/Realtime.html</a> <a href="http://javolution.org/apidocs/javolution/lang/Realtime.Limit.html" rel="nofollow">http://javolution.org/apidocs/javolution/lang/Realtime.Limit...</a> <a href="http://javolution.org/" rel="nofollow">http://javolution.org/</a> I don't know how this works in practice, it'll probably just warn if you try to use a LINEAR time function when you claim your function should be CONSTANT time, or if you put things inside loops. Not sure if it'd be worth trying out this approach for Core_kernel, but with OCaml annotations and ppx it might be possible to automatically infer and check:* annotate stack depth of function (none, tail-recursive, linear-recursive, tree-recursive)* annotate worst-case time complexity (guaranteed const (no loops, or just constant length ones, no recursion, no allocation); const (like guaranteed but with allocations); log (manual annotation); linear (recursion with decreasing list or loop); quadratic (one nested loop); unknown)* I/O blocking behaviour (although you do this by hiding the blocking functions in Async.Std already)