From a talk by Caitie McCaffrey of Twitter, at the Strange Loop conference.<p>Video: <a href="https://www.youtube.com/watch?v=H0i_bXKwujQ" rel="nofollow">https://www.youtube.com/watch?v=H0i_bXKwujQ</a><p>Slides: <a href="https://speakerdeck.com/caitiem20/building-scalable-stateful-services" rel="nofollow">https://speakerdeck.com/caitiem20/building-scalable-stateful...</a>
Stateful: you have one web server!<p>Stateless: you grow to require tens of servers or more, horizontal scalability is much cheaper than vertical, look to software solutions to help slow expenses, move to NoSQL clustered DBs like Riak, Casandra, Hadoop, etc. 1-2 engineers can still run the whole show, cloud services, SaaS and PaaS are employed.<p>Stateful: you run thousands of servers, having since brought many services back in-house. Many if not most are your own metal, with dedicated staff. Looking to slow power bills and space requirements, you look once again at software solutions.<p>If you stay at the same growing company long enough, what's old will be new again.
Thumb rule, if you design service with many servers you have following options:<p>1. Have a stateless service. You can update it frequently with no downtime... Relatively easy.<p>2. Use some of the shelf service that provides states and you don't need to update that frequently (e.g. ElastiCache, Cassandra, ....). Relatively easy.<p>3. Write your own stateful service. For some applications it is a must (e.g. you do your own search service, data processing, game collision engine). Need to take care of state transition during restarts/upgrades, client routing is also tricky. Hard, but sometimes there is no way around to build efficient infrastructure.<p>4. Don't think about state and you may end up crying after your code hits the prod.
A comparison with Erlang/OTP:<p><a href="http://christophermeiklejohn.com/papers/2015/05/03/orleans.html" rel="nofollow">http://christophermeiklejohn.com/papers/2015/05/03/orleans.h...</a>
I think that, in general, anything that has no persistence can be shared-nothing. State in shared-nothing consists basically of a cache which is updated by subscribing to changes in the data store and being updated, with only a slight lag.<p>Shared-nothing can include environments like user agents, proxies and web servers.<p>As for the persistence layer / data store, it should support horizontal partitioning. Especially useful is range-based partitioning based on a primary key whose prefix contains a Geohash ... because then you can route requests to the closest Region on AWS or some other host.<p>If one of your shards gets too large you can split it into two or more shards. All the monitoring and splitting can be automated with dev ops in the cloud to provision machines etc. so you don't need to wake up at 3am.<p>With this setup you can reliably grow your data store to an arbitrary size, and literally have only O(log n) growth in latency for any request. However there is one more issue to solve:<p>When you need to perform database queries that return a cross product, or join, do you compute it on the fly for the request (eg with mapreduce) or do you precompute the result whenever a row is inserted into one of the joined tables? The second way can be done in the background and uses memory-time tradeoff to cause the queries to be O(1). This can be really useful for queries that need to get the answer in realtime.<p>I would recommend using evented (eg Node.js) servers for queries that involve hitting multiple shards at the same time, or mapreduce type things. Evented I/O lets you wait only as long as the longest query.<p>Finally, I don't think things like socket.io will be horizontally partitionable easily, eg to a node cluster, so you'll probably want to have server affinity on a per-room basis.