Disclaimer: I work on Amazon Route 53 and Elastic Load Balancer.<p>From the article ... "Unfortunately there are a large number of (misbehaving) DNS servers out there that don’t properly obey TTLs on records and will still serve up stale records for an indefinite amount of time."<p>I would push back on how large this number is in general, whenever we experiment with DNS weights, we see about 98% of browser clients honouring the change and 99% within 5 minutes. But with mobile networks and java clients, things can be different. Mobile networks commonly have very few resolvers and so few answers in the mix to distribute load, and some versions of java cache answers forever by default.<p>Here's the hack we use to help with these situations! when you control the client it will help, it's something we've worked with some mobile app authors on. With Route 53 (and hopefully Dyn too, I'm not sure), you can configure a wildcard name to be a series of weighted entries, backed by health checks. So instead of;<p><pre><code> ping.chartbeat.net weight=1 answer=192.0.2.1 healthcheck=111
ping.chartbeat.net weight=1 answer=192.0.2.2 healthcheck=222
</code></pre>
it can be configured as;<p><pre><code> *.ping.chartbeat.net weight=1 answer=192.0.2.1 healthcheck=111
*.ping.chartbeat.net weight=1 answer=192.0.2.2 healthcheck=222
</code></pre>
so pretty much the same, but then you have the client look up;<p><pre><code> [ some random nonce / guid ].ping.chartbeat.net
</code></pre>
and voilà - you have busted any intermediate cache, and load is also spread more evenly (there are usually many more clients than DNS resolvers).<p>Self-promotion: If you do choose to use ELB and Route 53, we also support wildcard ALIASes to ELBs, and the queries are handled free of charge.
Instead of netstat(8) or ss(8) check out /proc/net/sockstat and /proc/net/netstat and /proc/net/tcp. Might as well save a fork and some context switches.<p><pre><code> net.ipv4.tcp_rmem=8192 873800 8388608
net.ipv4.tcp_wmem=4096 655360 8388608
net.ipv4.tcp_mem=8388608 8388608 8388608
</code></pre>
You may want to rethink this. Your default values would support initial send and receive windows of 400 & 600 packets. Ive ever seen initial windows that high in the wild. If it's a client you've seen recently they should be in the peer cache already. With this default receive allocation you only get 39,000 sockets mx. And once you exceed tcp_mem high your sockets will be force closed with a RST sent to the other side. Much better to have 'pressure' kick in and limit the buffers, throttling the send & receive windows.<p>Go look at 'mem' in sockstat. I'd guess your average utilization is more in the 50kB range. And that includes both send and receive and the tcp_info structs, IIRC.<p><pre><code> net.ipv4.tcp_max_orphans=262144
</code></pre>
That seems incredibly high, Id expect more in the ~5,000 range on a very busy host. Check your 'orphans' from sockstat.<p><pre><code> net.core.netdev_max_backlog = 16384</code></pre>
From the source comments this is actually a per CPU packet backlog, havent verified the implementation though.<p><pre><code> net.ipv4.tcp_max_tw_buckets=6000000</code></pre>
You may not need to do this. sysctl_max_tw_buckets limits the number of entries in the TIME_WAIT queue. When a socket moves to TIME_WAIT and the list is full it will instead go directly to CLOSE. Not very polite, and its <i>possible</i> you fail to retrans data, but IMHO a low risk scenario. See what level youre actually running at in sockstat.<p><pre><code> tcp_tw_recycle</code></pre>
The worst sysctl name ever. The useful part is setting the TIME_WAIT timer to socket RTO instead of TCP_TIMEWAIT_LEN (60 seconds). The terrible behavior is in tcp_v4_conn_request() of tcp_ipv4.c. The sysctl also enables strict timestamp & sequence checking on SYNs. If peers behind behind a NAT device have clocks > 1 second apart their SYNs will be silently dropped. IIRC PawsPassive from /proc/net/netstat will be incremented for each drop.<p><pre><code> tcp_tw_reuse</code></pre>
See tcp_twsk_unique() in tcp_ipv4.c. IIRC when you request a new ephemeral socket it's checked against the timewait socket list. If sysctl_tcp_tw_reuse is set and the TIME_WAIT socket is older than one second it can be reused. Normally TIME_WAIT sockets are aged out of the queue after TCP_TIMEWAIT_LEN, ~60 seconds.<p>On TIME_WAIT in general you should probably look in to setting the Maximum Segment Lifetime to a more reasonable value than 60s. You want to cover your max client RTO + one or two retrans. IMO something like 10s may be too short, but I can not imagine 30 not working splendidly. See TCP_TIMEWAIT_LEN & TCP_PAWS_MSL & wherever else I'm missing the header values.