12/05/2023

Have you ever tried to optimize a system but found it just would not get
any faster than some seemingly arbitrary point? Did it seem like the
stuff somehow had an agreement to where it would never deliver results
to in less than X milliseconds, even if it was unloaded and had a
super-quick network link between the devices?
This happened to some friends of mine a couple of years ago. They had
been running one version of some software for a long time, and it had
been forked off from upstream. It apparently had picked up a bunch of
local fixes for efficiency, correctness, and all of that good stuff.
Still, it had managed to miss out on a bunch of goodness, and so the
company eventually moved back to the open-source release.
Upon doing that, they noticed that no requests would complete in less
than 40 milliseconds, even if they had been doing it previously in the
same conditions on the older version of the code. This magic number
kept showing up: 40 ms here, 40 ms there. No matter what they did, it
would not go away.
I wish I had been there to find out whatever got them to turn the corner
to the solution. Alas, that detail is missing. But, we do know what
they discovered: the upstream (open source) release had forgotten to
deal with
the Nagle algorithm.
Yep. Have you ever looked at TCP code and noticed a couple of calls to
setsockopt() and one of them is TCP_NODELAY? That’s why. When that
algorithm is enabled on Linux, TCP tries to collapse a bunch of tiny
sends into fewer bigger ones to not blow a lot of network bandwidth
with the overhead. Unfortunately, in order to actually gather this up,
it involves a certain amount of delay and a timeout before flushing
smaller quantities of data to the network.
In this case, that timeout was 40 ms and that was significantly higher
than what they were used to seeing with their service. In the name of
keeping things running as quickly as possible, they patched it, and
things went back to their prior levels of performance.
There is an interesting artifact from this story: some of the people
involved made T-shirts showing the latency graph from their service both
before and after the fix.
Stuff like this just proves that a large part of this job is just
remembering a bunch of weird data points and knowing when to match this
story to that problem.
Incidentally, if this kind of thing matters to you, the man page you
want on a Linux box is tcp(7). There are a *lot* of little knobs in
there which might affect you depending on how you are using the network.
Be careful though, and don’t start tuning things just because they
exist. Down that path also lies madness.