It's worth noting that the instance migration basically null-routed the redis VM for a good 30 minutes, until we manually intervened and restarted it. The instance was completely disconnected from the internal network immediately following the migration. From what we could gather from instance logs, the routing table on the VM was completely dropped and it could not even connect to the magic metadata service (metadata.internal - we saw "no route to host" errors for that). This is a pretty serious bug within GCP and we've already opened a case with them hoping they can get a fix. I think this is the 4th or 5th major bug we've encountered with their live migration system that could have, or has led to an outage or internal service degradation. GCP team has seriously investigated and fixed every bug we've reported to them so far, so props to them for that! Live migration is incredibly difficult to get right.<p>We believe this triggered a bug in the redis-py python driver we use (specifically this one: <a href="https://github.com/andymccurdy/redis-py/pull/886" rel="nofollow">https://github.com/andymccurdy/redis-py/pull/886</a>) that made us have to rolling restart our API cluster in the first place, to get the connection pools back into a working state. redis-sentinel had appropriately detected the instance going away, and initiated a fail-over almost immediately following the instance going offline, but due to the odd network situation that was caused by the migration (absolute packet loss instead of connections being reset) - the client driver was unable to properly fail-over to the new master. We already have work planned for our own connection pooling logic for redis-py - as right now the state of the drive in HA redis is actually pretty awful, and the maintainer doesn't appear to have the time to close or look at PRs that address these issues (we opened one that fixes a pretty serious bug during fail-over in march <a href="https://github.com/andymccurdy/redis-py/pull/847" rel="nofollow">https://github.com/andymccurdy/redis-py/pull/847</a> that has yet to be addressed).
The level of detail and linearity is impressive.<p>At this scale, it seems like it may be warranted to start using reliability testing in production in like with Netflix.<p>At the end I see mention of a library with flaws. I am curious as to which library that is, given I develop some projects in Elixir.
Ever since they launched screen sharing, I've uninstalled both Skype and Hangouts and relied entirely on it for pair programming sessions. The smoothness of the reproduction is just incredible, and I don't see myself going back soon.
We had the exact same issue with RMQ (HA setup) on GCP (running on GKE) a few weeks ago. Tried contacting support about this, it's paid - no customer support for their own bugs.<p>The solution we came up so far is to disable automatic migrations. Not sure if that option actually does anything.