TechEcho

8 comments

Animatsabout 7 years ago

I've been struggling with transactional consistency across the network in, of all things, Second Life. Yes, Second Life, the virtual world.The classical web is mostly stateless, although that's changing with fancier web sites. Second Life has always been very much a persistent state system. Technically, this makes it an alternate universe to the classical web. So it hit these problems long ago.Second Life's world is divided into regions 256 meters on a side. The viewer displays a seamless world, but internally, the seams are very real. Each region is maintained by a separate process, a "sim", loosely coupled to neighboring sims and the viewer.Avatars and vehicles can cross region boundaries. Often badly. For over a dozen years, region crossing behavior in SL has been fragile. Objects sink through the ground or fly off into space at region crossings. Vehicles and avatars become separated. Avatars with elaborate clothing and attachments can even be damaged so badly that the user has to do extensive repair work. The Second Life community, and the developers of Second Life, were convinced this was un-fixable.I became interested in this problem as a SL user and started to work on it. The viewer is open source, and the message formats are known. Within the Second Life world, objects are scriptable. So, even without access to the internals of the server, much can be done from the outside.My goal was to make fast vehicles cross regions properly in Second Life. The first step was to fix some problems in the viewer. The viewer tries to hide the delay at a region crossing handoff with extrapolation from the last position and velocity. The extrapolation amplifies noise from the physics simulator, and can result in errors so bad that vehicles appear to roll over, fly into the air, or sink into the ground. I managed to limit extrapolation enough to restore movement sanity.Once that was under control, it was clear there were several different remaining problems. They now looked different visually, and could be attacked separately. There were several race conditions. I couldn't fix them completely from the outside, but I was able to detect them and prevent most of them from the scripting language code which controls vehicles. This got the mean time between failures up to about 30-40 minutes for a user driving a fast vehicle. When I started, that number was around 5-10.The remaining problems were intermittent. I discovered that overloading the network connection to the viewer tended to induce this failure. So I used network test features in Linux to introduce errors, delays, and packet reordering. It turned out that adding 1 second of network delay, with no errors or reordering, would consistently break region crossings. This provided, for the first time, a repeatable test case for the bug.I couldn't fix it, but I could provide a repeatable test case to the vendor, Linden Labs. With some publicity, upper management was made aware of the bug, and effort is now being applied to solving it. It turns out that the network retransmission code has problems. (Second Life uses its own UDP-based protocol.) Fixing that may help, but it's not clear that it's the entire problem.The underlying problem is that, during a region crossing, both region manager programs ("sims") and the viewer all have state machines which must make certain state transitions in a coordinated way. The error cases do not seem to have been thoroughly worked out. It's possible to get stuck out of sync. Second Life is supposed to be a consistent-eventually system, but that wasn't achieved.This is roughly the same problem as the "500 test" in the parent article. If you have communication problems, both ends must automatically resolved to a valid consistent state when communication resumes. Distributed database systems have to do this. It's not easy.Network-connected state machines are a pain to analyze. If the number of states at each end is not too large, examining all possible combinations by hand is feasible. That's what the 2 volumes of "TCP/IP Illustrated" do for TCP. If you create your own network connected state machines, you face a similar task. If you don't do it, your system will break.

评论 #16837019 未加载

saltcuredabout 7 years ago

We have faced this in a perverse form with some of our Python web apps behind Apache HTTPD and mod_wsgi. Apache itself will generate a 500 sporadically due to its internal TSL or HTTP/2 proxy code tripping over itself. If we capture one of these responses, it has the default Apache error page HTML structure, instead of the error page our own web framework would generate for a 500 if an uncaught exception leaked out.This 500 response to the client can occur before or after our service has been invoked, and our service logic doesn't know anything went wrong. Our service might even log successful processing, meaning that it made its own ACID commit and started to generate a success response. For very small responses like "204 No Content", we might never know there was a problem. For larger responses, the WSGI layer may produce errors when we cannot generate a whole response before the connection is lost.In our AJAX apps in front of this service, we have had to resort to treating 500 the same as a lost response, doing backoff and retry without assuming we know whether the first request succeeded or failed.

stephengillieabout 7 years ago

From time supporting a real estate website host - I call it 'debugging by error' - the error code tells you which part of the stack is broken.400 - URL/client side.401 - Permissions. If you have IIS, maybe the web agent service account password is expired.402 - Pay your bill.403 - Permissions, or a deeper error.404 - Missing file.500 - Code error.502 - Load balancer has no good hosts to service request. Or, if you have IIS, this is the Windows kernel saying no worker processes are working.503 - Database overloaded or down.504 - Server or load balancer not responding.

vvandersabout 7 years ago

Lots of wisdom here but much of this is easier said than done.That said a lot of the Erlang ethos covers this(things will fail at scale so do something reasonable).

mattsfreyabout 7 years ago

My point may seem facetious, but wouldn't ensuring (through proper catches and guards) that a 500 doesn't happen at all inside your own systems be the best resiliency? And when you experience one, it's a red alert and you harden that component to whatever failed immediately? I guess that's just the school of thought I've operated under.

评论 #16835717 未加载

评论 #16835081 未加载

sleaveyabout 7 years ago

ACIDity is an annoyingly missing feature in WordPress. For software that's used on something like 20% of all websites, it doesn't use the InnoDB MySQL table format by default, and therefore doesn't support MySQL transactions out of the box.

ameliusabout 7 years ago

> In the event of any internal fault (500), your service should be left in a state that’s consistent, ...What if the software enters a non-terminating loop?

baconomaticabout 7 years ago

This is interesting. Have you implemented this in services before? How have you handled continued failure for specific requests?

评论 #16834006 未加载

评论 #16833967 未加载

8 comments

Animatsabout 7 years ago

评论 #16837019 未加载

saltcuredabout 7 years ago

stephengillieabout 7 years ago

vvandersabout 7 years ago

Lots of wisdom here but much of this is easier said than done.That said a lot of the Erlang ethos covers this(things will fail at scale so do something reasonable).

mattsfreyabout 7 years ago

评论 #16835717 未加载

评论 #16835081 未加载

sleaveyabout 7 years ago

ameliusabout 7 years ago

> In the event of any internal fault (500), your service should be left in a state that’s consistent, ...What if the software enters a non-terminating loop?

baconomaticabout 7 years ago

This is interesting. Have you implemented this in services before? How have you handled continued failure for specific requests?

评论 #16834006 未加载

评论 #16833967 未加载

Software resilience: The 500 test

8 comments

Software resilience: The 500 test

8 comments