TechEcho

10 comments

rdtscover 10 years ago

Was going to make a pun on the title "... because uninterruptable sleep is a bitch", but it doesn't talk about that.Going back to the topic there are great points there. Remember discovering "tc qdisc" and playing with it. Really nice tool.But another thing to learn perhaps, is to try to avoid the gray zone by going to either the "black zone" = dead, or "white zone" = working fine. That is, if a node/process/VM/disk start showing signs of failure above a threshold, something else should kill/disable it or restart it.Think of it as trying to go to stable known states. "Machine is up, running, serving data, etc", "Machine is taken offline". If you can try to avoid in-between "gray states" -- "Some processes are working, some are not", "swap is full and running out of memory, oomkiller is going to town, some some services kinda work" and so on. There are just too many degrees of freedom and it is hard to test against all of them. Obviously somethings like network issues cannot be fixed with a simple restart so those have to be tested.

评论 #8600148 未加载

评论 #8600968 未加载

artursapekover 10 years ago

"Comcast" is pretty hilarious. <a href="https://github.com/tylertreat/Comcast" rel="nofollow">https://github.com/tylertreat/Comcast</a>

评论 #8600636 未加载

ReidZBover 10 years ago

If you'd like to simulate network crappiness on OS X, you can use the Network Link Conditioner from Apple themselves: <a href="http://nshipster.com/network-link-conditioner/" rel="nofollow">http://nshipster.com/network-link-conditioner/</a>I was very impressed with its feature-set (for what it is). On our team, we use it to see how our iOS app will react to severe network problems (via testing in the simulator, mostly, though it's also available on iOS devices themselves as explained in the above article).

peterwwillisover 10 years ago

This is the "I don't know how my network works, so let's throw a wrench into the works and see what happens, fix it, rinse, repeat" form of network and systems engineering. It's certainly useful at various points in tuning performance, but it doesn't replace actually designing your system to resist these problems to begin with.Even if you introduce these network performance issues, the results are meaningless if you don't have instrumentation ready to capture metrics on the results throughout the network/systems. Everyone wants to write about what happened when they partitioned their network. But you notice how nobody writes about the netflows, the taps, the service monitors, the interface stats, the app performance stats, the query run times, host connection state stats, miscellaneous network error stats, transaction benchmark stats, and hundreds of other data sources that are required to analyze the resulting network congestion.To me it's much more vital that I can correlate events to track down an issue in real-time. You will never be able to identify all possible failure types by making random things fail, but you can improve the process by which you identify a random problem and fix it quickly.

GuiAover 10 years ago

kill -9, no more CPU time<a href="https://m.youtube.com/watch?v=Fow7iUaKrq4" rel="nofollow">https://m.youtube.com/watch?v=Fow7iUaKrq4</a>

eosisover 10 years ago

You have to be careful using iptables dropping rules on the OUTPUT tables, as this manifests itself ( at least on our systems ) as failed send socket calls (which are often retried by the application), rather than true packetloss. Netem tends to work as expected.

zorboover 10 years ago

This focuses mostly on simulating unreliabable networking. Is there a tool, perhaps some LD_PRELOAD wrapper, that can simulate unreliable everything? I'm talking memory errors, disks going away, fake high I/O load, etc?I once wrote a library for python that injected itself into the main modules (os, sys, etc) and generated random failures all over the place. It worked very well for writing reliable applications, but it only worked for pure python code. I don't own the code, so I can't open source it unfortunately.

评论 #8601552 未加载

tlarkworthyover 10 years ago

I recognise those commands ...<a href="http://stackoverflow.com/questions/614795/simulate-delayed-and-dropped-packets-on-linux" rel="nofollow">http://stackoverflow.com/questions/614795/simulate-delayed-a...</a>I am still trying to work out how I not knobble my DB connection when trying to simulate client errors on a single dev machine.

评论 #8600155 未加载

noonespecialover 10 years ago

Brings back horrible memories of writing tc scripts to simulate VSAT and rural dsl back in the bad old days. We bundled them up on a Soekris box and called it the "DSLow" (as in DSL-oh) box.

mu_killnineover 10 years ago

I find this article offensive ;)

10 comments

rdtscover 10 years ago

评论 #8600148 未加载

评论 #8600968 未加载

artursapekover 10 years ago

"Comcast" is pretty hilarious. <a href="https://github.com/tylertreat/Comcast" rel="nofollow">https://github.com/tylertreat/Comcast</a>

评论 #8600636 未加载

ReidZBover 10 years ago

peterwwillisover 10 years ago

GuiAover 10 years ago

kill -9, no more CPU time<a href="https://m.youtube.com/watch?v=Fow7iUaKrq4" rel="nofollow">https://m.youtube.com/watch?v=Fow7iUaKrq4</a>

eosisover 10 years ago

zorboover 10 years ago

评论 #8601552 未加载

tlarkworthyover 10 years ago

评论 #8600155 未加载

noonespecialover 10 years ago

Brings back horrible memories of writing tc scripts to simulate VSAT and rural dsl back in the bad old days. We bundled them up on a Soekris box and called it the "DSLow" (as in DSL-oh) box.

mu_killnineover 10 years ago

I find this article offensive ;)

Sometimes Kill -9 Isn't Enough

10 comments

Sometimes Kill -9 Isn't Enough

10 comments