Was going to make a pun on the title "... because uninterruptable sleep is a bitch", but it doesn't talk about that.<p>Going back to the topic there are great points there. Remember discovering "tc qdisc" and playing with it. Really nice tool.<p>But another thing to learn perhaps, is to try to avoid the gray zone by going to either the
"black zone" = dead, or "white zone" = working fine. That is, if a node/process/VM/disk start showing signs of failure above a threshold, something else should kill/disable it or restart it.<p>Think of it as trying to go to stable known states. "Machine is up, running, serving data, etc", "Machine is taken offline". If you can try to avoid in-between "gray states" -- "Some processes are working, some are not", "swap is full and running out of memory, oomkiller is going to town, some some services kinda work" and so on. There are just too many degrees of freedom and it is hard to test against all of them. Obviously somethings like network issues cannot be fixed with a simple restart so those have to be tested.
If you'd like to simulate network crappiness on OS X, you can use the Network Link Conditioner from Apple themselves: <a href="http://nshipster.com/network-link-conditioner/" rel="nofollow">http://nshipster.com/network-link-conditioner/</a><p>I was very impressed with its feature-set (for what it is). On our team, we use it to see how our iOS app will react to severe network problems (via testing in the simulator, mostly, though it's also available on iOS devices themselves as explained in the above article).
This is the "I don't know how my network works, so let's throw a wrench into the works and see what happens, fix it, rinse, repeat" form of network and systems engineering. It's certainly useful at various points in tuning performance, but it doesn't replace actually designing your system to resist these problems to begin with.<p>Even if you introduce these network performance issues, the results are meaningless if you don't have instrumentation ready to capture metrics on the results throughout the network/systems. Everyone wants to write about what happened when they partitioned their network. But you notice how nobody writes about the netflows, the taps, the service monitors, the interface stats, the app performance stats, the query run times, host connection state stats, miscellaneous network error stats, transaction benchmark stats, and hundreds of other data sources that are required to analyze the resulting network congestion.<p>To me it's much more vital that I can correlate events to track down an issue in real-time. You will never be able to identify all possible failure types by making random things fail, but you can improve the process by which you identify a random problem and fix it quickly.
kill -9, no more CPU time<p><a href="https://m.youtube.com/watch?v=Fow7iUaKrq4" rel="nofollow">https://m.youtube.com/watch?v=Fow7iUaKrq4</a>
You have to be careful using iptables dropping rules on the OUTPUT tables, as this manifests itself ( at least on our systems ) as failed send socket calls (which are often retried by the application), rather than true packetloss. Netem tends to work as expected.
This focuses mostly on simulating unreliabable networking. Is there a tool, perhaps some LD_PRELOAD wrapper, that can simulate unreliable everything? I'm talking memory errors, disks going away, fake high I/O load, etc?<p>I once wrote a library for python that injected itself into the main modules (os, sys, etc) and generated random failures all over the place. It worked very well for writing reliable applications, but it only worked for pure python code. I don't own the code, so I can't open source it unfortunately.
I recognise those commands ...<p><a href="http://stackoverflow.com/questions/614795/simulate-delayed-and-dropped-packets-on-linux" rel="nofollow">http://stackoverflow.com/questions/614795/simulate-delayed-a...</a><p>I am still trying to work out how I not knobble my DB connection when trying to simulate client errors on a single dev machine.
Brings back horrible memories of writing tc scripts to simulate VSAT and rural dsl back in the bad old days. We bundled them up on a Soekris box and called it the "DSLow" (as in DSL-oh) box.