I'm glad to see more cluster management software getting open sourced, and this is sort of on the right track.<p>However, looking at the design, this still has a long way to go. There are a lot of failure modes you guys haven't encountered yet, which will result in a few design tweaks. For example, what happens if your health checkers decide to start reporting garbage data (e.g. maybe they are too overloaded to properly perform health checks)? Or when you have a query of death being issued? Also, things like traffic sloshing can very quickly build resonant failures in a system like this.<p>(Source: many years working on Google infrastructure, including causing outages related to load balancing code)
Hi guys! I'm one of the primary authors of SmartStack. Happy to answer any questions that aren't covered in the blog post.<p>We're also doing a Tech Talk on SmartStack today at Airbnb HQ; stop by if you're in SF: <a href="https://www.airbnb.com/meetups/33925h2sx-tech-talk-smartstack-scaling-airbnb" rel="nofollow">https://www.airbnb.com/meetups/33925h2sx-tech-talk-smartstac...</a>
Great post. This is interesting due to similar discussions we're having at work about moving from a monolithic Rails app architecture to an SOA.<p>I'm curious though, what does the local developer environment look like when you run an SOA of this complexity? Does everyone needs to run a series of Vagrant VMs/Docker containers to have a fully functional local version of the application running?
> On AWS, you might be tempted to use ELB, but ELBs are terrible at internal load balancing because they only have public IPs.<p>Yet another reason to run in a VPC, which includes internal-only ELBs as a very useful feature.
ELB will actually do internal load balancing in a VPC using your own custom security groups. Doesn't help if you're not in a VPC, but nowadays the default is for everything to go in a VPC.
I don't know... sounds like they just shifted the 'single point of failure' to the Zookeeper cluster. Is it somehow sexier to have your SPOF be things running Zookeeper instead of things running loadbalancer or DNS software?:<p>"The achilles heel of SmartStack is Zookeeper. Currently, the failure of our Zookeeper cluster will take out our entire infrastructure. Also, because of edge cases in the zk library we use, we may not even be handling the failure of a single ZK node properly at the moment."
Cool stuff I think though that much of this can be handled with other ways of doing things (although obviously there is never one right way of doing these kinds of things). This application kit is one way of orchestrating service/server discovery. Another way, which I have implemented personally is to use a combination of mcollective and puppet (with puppet facts enabled). This allows you to defined roles for specific systems and run tasks against servers of that specific role type, keep track of which servers are that role type, connect them to a 'central' load-balancer and many other things.<p>This serves to solve most of the issues that this toolkit provides for, but likely would not be the good option for everyone. Just some info on at least one other way to deal with this stuff!
I've been hearing good stuff about this for a while - it's awesome to see it open sourced! Setting up a local haproxy to handle service failover / discovery is a really clever solution, and is an awesome approach to encapsulating a bunch of really messy logic.
So.. SOA.<p>Everyone is doing it, but I never see details about how the services are wired together.<p>Do people use ESBs, or directly wire up services to each other?