Bottlenecking on logging is a common problem. Classically, Linux assumed it has a serial console device. It can still be enabled at kernel compile time.[1] Apparently, this mass of AWS instances, VMs, and Docker images works that way.<p>I liked the QNX approach, where the kernel sends log messages to another process, loaded on boot. When you build a boot image, you provide a logger process to read those messages. There's no expectation in embedded that there's a console available. If your system is a pump or an auto dashboard, you need to send the messages somewhere other than a "console". There might not even be a file system. In QNX, messaging is a primitive on top of which file systems and networking are built.<p>When I had to do logging in real time, each process used a library which did "logprintf" calls. Those went into a circular buffer which was written to a log file from another thread. If the circular buffer filled, "..." would appear in the log, and log messages would be lost, but the real-time thread would not block.<p>Interaction between real-time and non-real-time is always tough. It comes up a lot in networked game development.<p>[1] <a href="https://www.kernel.org/doc/html/latest/admin-guide/serial-console.html" rel="nofollow">https://www.kernel.org/doc/html/latest/admin-guide/serial-co...</a>
> There are a few processes in the bot that are especially latency sensitive, which we have tuned the nice value for.<p>This immediately signifies some level of "does not understand how this stuff works". Latency sensitive (audio or other) tasks need to be in the SCHED_RR or SCHED_FIFO scheduling class, which nice(1) has no effect on.<p>Conversely, using nice(1) on a SCHED_OTHER task is also unlikely to work, given that nice only impacts scheduling decisions and cannot provide RT-like behavior.
I am surprised.. all this complex discovery and the syslogs were containing the culprit line from the first moment:<p><pre><code> serial8250: too much work for irq4
</code></pre>
do people not look at syslog anymore? It's one of the first things I do on unexplainable problems - check for OOM's, thermal throttling, BUGs, etc... Sure, it's not the most common problem, but the check is fast and easy.
I'm curious why this is even an issue. I don't understand why an actual interrupt would get tripped for virtual serial port writes, and I don't understand why a virtual serial port (i.e. logging) gets swamped by what seems like a moderate stream of data.<p>The first result for "8250 too much work for irq4" is this: <a href="https://unix.stackexchange.com/questions/387600/understanding-serial8250-too-much-work-for-irq4-kernel-message" rel="nofollow">https://unix.stackexchange.com/questions/387600/understandin...</a> (2017)<p><i>The problem is that the UART hardware that is emulated by various brands of virtual machine behaves impossibly, sending characters at an impossibly fast line speed. To the kernel, this is indistinguishable from faulty real UART hardware that is continually raising an interrupt for an empty output buffer/full input buffer. (Such faulty real hardwares exist, and you will find embedded Linux people also discussing this problem here and there.) The kernel pushes the data out/pulls the data in, and the UART is immediately raising an interrupt saying that it is ready for more.</i><p>So, is this seven year old problem still a problem, or is the virtual serial port driver actually unable to keep up with the stream of text?
The more we abstract things, treat servers like cattle, and lose low level knowledge, the more things like this will happen<p>You shouldn’t have to try to reproduce this in a test environment - your infrastructure should allow profiling in live for cases like this. And it should be solved with profiling, not guesswork and bisecting
What I don't understand, is why the upstream audio doesn't just buffer while the downstream thing processing it is blocked. Why should that result in audible artifacts, can't it just catch up with the rest of the buffer later?<p>Buffer overruns feels very 1996-cd-burner-ish. Ope, burned a coaster, let's try this hellaciously real-time-bound thing again with inadequate buffering and I/O devices that have unpredictable latency.<p>What am I missing?
AIUI the RT patchset (now mainlined) has specific code changes to solve this issue.<p>If running PREEMPT_RT, which anything handling realtime audio should really do, this should be handled.
At Recall.ai, we built a 10,000-node cluster, processing over 1TB/sec of raw video in real-time.<p>After a major migration, we faced a strange audio issue that led us on a deep dive through our infrastructure.<p>The culprit? Not the audio code—but a hidden interaction with AWS’s virtual serial ports.<p>We wrote about our journey discovering the artifacts and finding a clean fix!