Every time a new generation plays with the LMAX disruptor, it's time to remind them that the modes with multiple producers/consumers can have really bad tail latency if your application's threading is not designed in the intended way.<p>Disruptor and most other data structures that come from trading are designed to run with thread-per-core systems. This means systems where there will be no preemption during a critical section. They can get away with a lot of shenanigans on the concurrency model due to this. If you are using these data structures and have a thread-per-request model, you're probably going to have a bad time.
I am working on a C version of the disruptor ringbuffer it is very simple and I need to verify it so it's probably not ready for others but it might be interesting. Aligning by 128 bytes has dropped latency and stopped false sharing.<p>I have gotten latencies to 50 nanoseconds and up.<p>disruptor-multi.c(SPMC) and disruptor-multi-producer.c (MPSC)
<a href="https://GitHub.com/samsquire/assembly">https://GitHub.com/samsquire/assembly</a><p>I am trying to work out how to support multiple producers and multiple readers (MPMC) at low latency that's what I'm literally working on today.<p>The MPSC and SPMC seem to be working at low latencies.<p>I am hoping to apply actor model to the ringbuffer for communication.<p>I'm also working on nonblocking lock free barrier. This has latency as low as 42 nanoseconds and up.
I had implemented more-or-less this same concurrency scheme for an IPS/DDoS prevention box ~10 years ago, running on Tilera architecture. It was fast (batching + separating read & write heads really does help a ton)... but not as fast as Tilera's built-in intercore fabric. It had some limitations but was basically a register store/load to access and only like 1 or 2 cycles intercore latency.<p>(Aside, generic atomic operation pro-tip: don't if you can help it. Load + local modify + store is always faster than atomic modify, if you can make the memory ordering work out. And if you can't do away with an atomic modify, batch your updates locally to issue fewer of them at least.)
Related. Others?<p><i>Disruptor: High performance alternative to bounded queues</i> - <a href="https://news.ycombinator.com/item?id=36073710">https://news.ycombinator.com/item?id=36073710</a> - May 2023 (1 comment)<p><i>LMAX Disruptor: High performance method for exchanging data between threads</i> - <a href="https://news.ycombinator.com/item?id=30778042">https://news.ycombinator.com/item?id=30778042</a> - March 2022 (1 comment)<p><i>The LMAX Architecture</i> - <a href="https://news.ycombinator.com/item?id=22369438">https://news.ycombinator.com/item?id=22369438</a> - Feb 2020 (1 comment)<p><i>You could have invented the LMAX Disruptor, if only you were limited enough</i> - <a href="https://news.ycombinator.com/item?id=17817254">https://news.ycombinator.com/item?id=17817254</a> - Aug 2018 (29 comments)<p><i>Disruptor: High performance alternative to bounded queues (2011) [pdf]</i> - <a href="https://news.ycombinator.com/item?id=12054503">https://news.ycombinator.com/item?id=12054503</a> - July 2016 (27 comments)<p><i>The LMAX Architecture (2011)</i> - <a href="https://news.ycombinator.com/item?id=9753044">https://news.ycombinator.com/item?id=9753044</a> - June 2015 (4 comments)<p><i>LMAX Disruptor: High Performance Inter-Thread Messaging Library</i> - <a href="https://news.ycombinator.com/item?id=8064846">https://news.ycombinator.com/item?id=8064846</a> - July 2014 (2 comments)<p><i>Serious high-performance and lock-free algorithms (by LMAX devs)</i> - <a href="https://news.ycombinator.com/item?id=4022977">https://news.ycombinator.com/item?id=4022977</a> - May 2012 (17 comments)<p><i>The LMAX Architecture - 100K TPS at Less than 1ms Latency</i> - <a href="https://news.ycombinator.com/item?id=3173993">https://news.ycombinator.com/item?id=3173993</a> - Oct 2011 (53 comments)
Semi-related is the Aeron project: <a href="https://github.com/real-logic/aeron">https://github.com/real-logic/aeron</a>
I've actually seen this particular library used (and misused and abused). People tried to offload I/O and data heavy tasks on it and was a spectacular fail, with multiple threads getting blocked and people having to frequently adjust it's buffer size and batch size.<p>One of those things to remember is that Java I/O layering (stuff like JPA) is really terrible. And people in my known Java world tend to prefer the abstractions while the people in the trading world try to use GC-less code (unboxed primitives and byte arrays).<p>Unless you have verified your E2E I/O to be really fast (possible off heap), you're just pushing a few bytes here and there, your latencies are all in check - this library is not for you. Do all that work first, then use this library.
LMAX - How to Do 100K TPS at Less than 1ms Latency: Video<p><a href="https://www.infoq.com/presentations/LMAX/" rel="nofollow noreferrer">https://www.infoq.com/presentations/LMAX/</a>
I built trading systems for LMAX exchanges. Their technology seems quite far from the state of the art to me.<p>I didn't know they even claimed to attempt being the fastest exchange in the world. They're very far from being so and it's quite clear that there are architectural decisions in that platform that would prevent that.
Martin Fowler has a lovely deep-dive blog post on this architecture:<p><a href="https://martinfowler.com/articles/lmax.html" rel="nofollow noreferrer">https://martinfowler.com/articles/lmax.html</a><p>It includes lots of diagrams and citations.<p>One term I always loved re: LMAX is “mechanical sympathy.” Covered in this section:<p><a href="https://martinfowler.com/articles/lmax.html#QueuesAndTheirLackOfMechanicalSympathy" rel="nofollow noreferrer">https://martinfowler.com/articles/lmax.html#QueuesAndTheirLa...</a>
I came across this a few years back when numbly watching the dependencies scroll by during some Java install.<p>“Disruptor is a fairly presumptuous name for a package” I thought. So I looked into it. It fed musings and thought experiments for many walks to and from the T. I love the balance between simplicity and subtlety in the design.<p>If i recall, it was a dependency for log4j, which makes sense for high volume logging.
I love this pattern. There are many problems that fit it quite well once you start thinking in these terms - Intentionally delaying execution over (brief amounts of) time in order to create batching opportunities which leverage the physical hardware's unique quirks.<p>Any domain with synchronous/serializable semantics can modeled as a single writer, with an MPSC queue in front of it. Things like game worlds, business decision systems, database engines, etc. fit the mold fairly well.<p>The busy-spin strategy can be viewed as a downside, but I can't ignore the latency advantages. In my experiments where I have some live "analog" input like a mouse, the busy wait strat feels 100% transparent. I've tested it for hours without it breaking into the millisecond range (on windows 10!). For gaming/real-time UI cases, you either want this or yield. Sleep strategies are fine if you can tolerate jitter in the millisecond range.
Beware the advertised latency will probably be when using the busy-spin wait strategy which uses a lot of CPU resource.<p>Great library which makes processing concurrent streams incredibly easy.