TechEcho

7 comments

jhggover 7 years ago

It's worth noting that the instance migration basically null-routed the redis VM for a good 30 minutes, until we manually intervened and restarted it. The instance was completely disconnected from the internal network immediately following the migration. From what we could gather from instance logs, the routing table on the VM was completely dropped and it could not even connect to the magic metadata service (metadata.internal - we saw "no route to host" errors for that). This is a pretty serious bug within GCP and we've already opened a case with them hoping they can get a fix. I think this is the 4th or 5th major bug we've encountered with their live migration system that could have, or has led to an outage or internal service degradation. GCP team has seriously investigated and fixed every bug we've reported to them so far, so props to them for that! Live migration is incredibly difficult to get right.<p>We believe this triggered a bug in the redis-py python driver we use (specifically this one: <a href="https://github.com/andymccurdy/redis-py/pull/886" rel="nofollow">https://github.com/andymccurdy/redis-py/pull/886</a>) that made us have to rolling restart our API cluster in the first place, to get the connection pools back into a working state. redis-sentinel had appropriately detected the instance going away, and initiated a fail-over almost immediately following the instance going offline, but due to the odd network situation that was caused by the migration (absolute packet loss instead of connections being reset) - the client driver was unable to properly fail-over to the new master. We already have work planned for our own connection pooling logic for redis-py - as right now the state of the drive in HA redis is actually pretty awful, and the maintainer doesn't appear to have the time to close or look at PRs that address these issues (we opened one that fixes a pretty serious bug during fail-over in march <a href="https://github.com/andymccurdy/redis-py/pull/847" rel="nofollow">https://github.com/andymccurdy/redis-py/pull/847</a> that has yet to be addressed).

评论 #15488351 未加载

corditeover 7 years ago

The level of detail and linearity is impressive.<p>At this scale, it seems like it may be warranted to start using reliability testing in production in like with Netflix.<p>At the end I see mention of a library with flaws. I am curious as to which library that is, given I develop some projects in Elixir.

评论 #15487769 未加载

phreackover 7 years ago

Ever since they launched screen sharing, I've uninstalled both Skype and Hangouts and relied entirely on it for pair programming sessions. The smoothness of the reproduction is just incredible, and I don't see myself going back soon.

ZeroCool2uover 7 years ago

I'm really impressed. I was using Discord for most of this weekend, specifically Friday and Saturday. Never noticed any issues.

评论 #15487894 未加载

评论 #15486598 未加载

评论 #15486769 未加载

humanfromearthover 7 years ago

We had the exact same issue with RMQ (HA setup) on GCP (running on GKE) a few weeks ago. Tried contacting support about this, it's paid - no customer support for their own bugs.<p>The solution we came up so far is to disable automatic migrations. Not sure if that option actually does anything.

评论 #15487902 未加载

atomicalover 7 years ago

Does anyone use discord for work?

评论 #15486492 未加载

评论 #15486865 未加载

评论 #15486514 未加载

评论 #15488460 未加载

评论 #15487264 未加载

评论 #15487809 未加载

评论 #15488117 未加载

lwansbroughover 7 years ago

Funny definition of HA. :)

7 comments

jhggover 7 years ago

评论 #15488351 未加载

corditeover 7 years ago

评论 #15487769 未加载

phreackover 7 years ago

ZeroCool2uover 7 years ago

I'm really impressed. I was using Discord for most of this weekend, specifically Friday and Saturday. Never noticed any issues.

Discord Postmortem from Friday

7 comments

Discord Postmortem from Friday

7 comments