I've been struggling with transactional consistency across the network in, of all things, Second Life. Yes, Second Life, the virtual world.<p>The classical web is mostly stateless, although that's changing with fancier web sites. Second Life has always been very much a persistent state system. Technically, this makes it an alternate universe to the classical web. So it hit these problems long ago.<p>Second Life's world is divided into regions 256 meters on a side. The viewer displays a seamless world, but internally, the seams are very real. Each region is maintained by a separate process, a "sim", loosely coupled to neighboring sims and the viewer.<p>Avatars and vehicles can cross region boundaries. Often badly. For over a dozen years, region crossing behavior in SL has been fragile. Objects sink through the ground or fly off into space at region crossings. Vehicles and avatars become separated. Avatars with elaborate clothing and attachments can even be damaged so badly that the user has to do extensive repair work. The Second Life community, and the developers of Second Life, were convinced this was un-fixable.<p>I became interested in this problem as a SL user and started to work on it. The viewer is open source, and the message formats are known. Within the Second Life world, objects are scriptable. So, even without access to the internals of the server, much can be done from the outside.<p>My goal was to make fast vehicles cross regions properly in Second Life. The first step was to fix some problems in the viewer. The viewer tries to hide the delay at a region crossing handoff with extrapolation from the last position and velocity. The extrapolation amplifies noise from the physics simulator, and can result in errors so bad that vehicles appear to roll over, fly into the air, or sink into the ground. I managed to limit extrapolation enough to restore movement sanity.<p>Once that was under control, it was clear there were several different remaining problems.
They now looked different visually, and could be attacked separately. There were several race conditions. I couldn't fix them completely from the outside, but I was able to detect them and prevent most of them from the scripting language code which controls vehicles. This got the mean time between failures up to about 30-40 minutes for a user driving a fast vehicle. When I started, that number was around 5-10.<p>The remaining problems were intermittent. I discovered that overloading the network connection to the viewer tended to induce this failure. So I used network test features in Linux to introduce errors, delays, and packet reordering. It turned out that adding 1 second of network delay, with no errors or reordering, would consistently break region crossings. This provided, for the first time, a repeatable test case for the bug.<p>I couldn't fix it, but I could provide a repeatable test case to the vendor, Linden Labs. With some publicity, upper management was made aware of the bug, and effort is now being applied to solving it. It turns out that the network retransmission code has problems. (Second Life uses its own UDP-based protocol.) Fixing that may help, but it's not clear that it's the entire problem.<p>The underlying problem is that, during a region crossing, both region manager programs ("sims") and the viewer all have state machines which must make certain state transitions in a coordinated way. The error cases do not seem to have been thoroughly worked out. It's possible to get stuck out of sync. Second Life is supposed to be a consistent-eventually system, but that wasn't achieved.<p>This is roughly the same problem as the "500 test" in the parent article. If you have communication problems, both ends must automatically resolved to a valid consistent state when communication resumes. Distributed database systems have to do this. It's not easy.<p>Network-connected state machines are a pain to analyze. If the number of states at each end is not too large, examining all possible combinations by hand is feasible. That's what the 2 volumes of "TCP/IP Illustrated" do for TCP. If you create your own network connected state machines, you face a similar task. If you don't do it, your system will break.