Pretty fun puzzle, but this kind of debugging is alien to me. Without going into too much detail, it's pretty easy to see something suspect is going on, and this hunch can be easily tested. This debugging tour takes you to strace, wireshark, all sorts of other low level debugging techniques, when really all you had to do was simulate the client with curl -d and the problem would have been pretty obvious.<p>And in this case the more complicated debugging tools didn't even explain anything. As the last page says, the answer is kind of a leap. If you already knew about this problem curl would have solved it immediately, and if you didn't you'd still be baffled even after knowing exactly why the stack traces are the way they are.
Really nice game/tutorial.<p>The best job interview I ever had was framed like this. The interviewer told me there was a bug in the system and had a stack of pages he'd printed out that would provide successive clues as to what caused it. I could ask them questions, in effect using the interviewer as a search engine/debugger.<p>It was the closest an interview has ever come to simulating the day-to-day of a web developer.
Author here. I wrote a post about the design of this game on my blog: <a href="https://jvns.ca/blog/2021/04/16/notes-on-debugging-puzzles/" rel="nofollow">https://jvns.ca/blog/2021/04/16/notes-on-debugging-puzzles/</a>
Knowledge of delayed ack and nagle's algorithm wasn't necessary to solve.<p>The explanation didn't mention the flushHeaders call, which is apparently the fix. I didn't run any tests, just looked at the JS and figured that sending 2 packets is worse than sending 1 w.r.t. latency.<p>It's also a pretty strong intuition that the client side tends to have issues, since the server is usually well-tested and standardized. Also, very often people are measuring wrong, so checking the JS to be sure the time recorded is accurate is also important.<p><a href="https://nodejs.org/api/http.html#http_request_flushheaders" rel="nofollow">https://nodejs.org/api/http.html#http_request_flushheaders</a>
As requested by the commenter below (now I want spoiler tags):<p><i>SPOILER FOR THE GAME</i><p>I’ve seen TCP_NODELAY all over the place before, but never known why. This was a fun way to find out.
I got it straight away, but I have a telecoms engineering and networking background. This just goes to show how poorly networking is taught in most courses (as are databases), and specially in boot camps. Self-taught programmers are also very unlikely to be exposed to this kind of topic.<p>One thing that was not offered as an option was to use a packet analyzer like tcpdump or Wireshark, even though that is the most reliable and systematic way to get to the bottom of many performance problems. You'd think the popularity of the network tab in Chrome's dev tools would make this less scary.
The interesting thing for me is that I would suspect most devs including myself would assume that if the request takes 50ms, that's how long it takes (because networks!).<p>I wonder how many of us are able to judge how long something should take? Not me, except anecdotally.
I ran into an interesting variation of this where we shouldn't have had any problems with small packets, but it turned out we had having jumbo frames enabled in AWS (which seems to be a default now). Together with gzip, you can actually have a bit of trouble filling up a packet, which will then be delayed by the commonly mentioned interaction with delayed ACKs.
Error: <<you-said>>: error within widget contents (Error: cannot find a closing tag for HTML <pid>)<p>This hints to a possible XSS and/or code injection (not completely quoted input). Input was "strace -s128 -f -p <pid>" , as an answer to "how do you strace server process"
A comment from the inventor of Nagle’s algorithm: <a href="https://news.ycombinator.com/item?id=9050645" rel="nofollow">https://news.ycombinator.com/item?id=9050645</a><p>(tl;dr Try turning off delayed ACK first, especially if you can’t update the code.)
SPOILER ALERT<p>I answered "req.flushHeaders()" but surprisingly it doesn't accept that as a cause, even though the headers would be sent with the initial packet and should improve the latency.
"Also, the Linux kernel doesn't always enable delayed ACKs -- to reproduce this I actually had to write a Python program that explicitly turns them on. I haven't been able to find a clear explanation of exactly when delayed ACKs are used."<p>Delayed ACKs can be enabled on Linux kernels 3.11+ with ip(8).<p><pre><code> ip route change ROUTE quickack 1
</code></pre>
On MacOS and Windows, delayed ACKs can be configured through sysctl and the registry, respectively.<p>Delayed ACKs may be used in response to congestion.<p>For example, in bulk, i.e., non-interactive, transfers with large packets, delayed ACKs can be useful.<p>This is covered in Chapters 15 (15.3) and 16 of Stevens' TCP/IP Illustrated Vol. 1.<p>This draft suggests delayed ACKs are useful during TLS handshake.<p><a href="https://tools.ietf.org/id/draft-stenberg-httpbis-tcp-03.html" rel="nofollow">https://tools.ietf.org/id/draft-stenberg-httpbis-tcp-03.html</a><p>Also, socat allows for setting TCP options via setsockopt. No need to write a new program.
Twine and SugarCube! Interesting to see that pop up here:<p><a href="https://www.motoslave.net/sugarcube/2/" rel="nofollow">https://www.motoslave.net/sugarcube/2/</a>
That was exceptionally fun. I thought I had the answer but I was completely wrong. I shouldn't have stopped the debugging and rush to the solution. Unfortunately it is a game, and it allowed me to do it.<p>To me this seems pretty obscure and you debug pretty deep into and outside of your application. One part of me thinks of this as Somebody Else's Problem, but definitively makes me rethink it as a SEP and something devs should know about. Specially in time critical/real time systems.
Based on the description, guessed 'nagle' without any debugging. Not sure if it implies I was right or wrong, but it explained Nagle's algorithm.<p>Asked what to do about it and typed 'tcpnodelay'. It replied it wasn't smart and asked me to click a button.<p>Feels pretty basic to anyone who's ever really touched TCP in code.
A few months ago there had been multiple articles about this behaviour but i really don‘t remember the details anymore. Does anyone know a writeup with a detailled explanation to understand how it is happening and tests to see whether your systems are affected?
Loved it!
Although it didn’t give a comprehensive answer on how to preventively solve the problem on any platform.
Is there one?
Or should software developers just stick to one way or another like some kind of an unspoken rule?
Nice, but not perfect:<p>```
You said: "strace -p $(ps aux| grep server.py| grep -v "grep"| awk -F ' ' '{print $2}')".<p>To strace the server, first you need to find its PID. You know that the program is called server.py.
```
grr.. I felt stupid going through the puzzle.. but I was looking at that flushHeaders call and thought that might be a problem - simply because I never call that.
All due respect, this is a neat advertisment for the "storytelling" Javascript library she is using, but I learn much more by reading W R Stevens' books. There is more to TCP/IP than what one can do through Berkeley sockets. Plus reading Stevens' books does not require Javascript.