Great post, but this part scares me a bit...<p><i>I think a lot of services (even banks!) have serious security problems and seem to be able to weather a small PR storm. So figure it out if it really is important to you (are you worth hacking? do you actually care if you’re hacked? is it worth the engineering or product cost?) before you go and lock down everything.</i><p>Just because you can "afford" to be hacked, doesn't mean you shouldn't take all the steps necessary to proactively protect your data. In the end, security is not about you, it is about your users. This is exactly the type of attitude that leads to all the massive breaches we have been seeing recently. Sure your company is "hurt" with bad PR, but really your users are the ones who are the real victims. You should consider their risk (especially with something as sensitive as people's files!) before you consider your own company's well being.<p>Edit: formatting
The idea of running extra load - it sounds good in theory but I can't help thinking that it's a bit like setting your watch forwards to try and stop being late for things. Eventually you know your watch is 5 minutes fast so start compensating for it. I wonder if this strategy starts to have the same effect - putting fixes off because you know you can pull the extra load before it becomes critical. In the same way you leave for the train a couple of minutes later because you know your watch is actually running fast.
I wish he'd left the security advice out.<p>The whole post was excellent, but all the useful points will now be overshadowed by the armchair quarterbacking about security by people who mostly don't understand that <i>ALL</i> security is a compromise, and it is as important to <i>understand</i> and make deliberate decisions about your security as it is to try to make a secure system in the first place.
<i>but I really hate ORM’s and this was just a giant nuisance to deal with</i><p>I like object relational mapping as a theory (ie. I have an object of type Author which has 1 or more books I can loop over), but I hate ActiveRecord implementations. Eventually, they just end up implementing almost all of SQL but in some arcane bullshit syntax or sequence of method calls that you have to spend a bunch of time learning.<p>I also seriously doubt that anyone has ever written a production system of any reasonable complexity and been able to use the exact same ORM code with absolutely any backend (if you have an example please correct me on this). This barely even works with something like PDO in PHP which is a bare bones abstraction across multiple SQL backends.<p>When it comes down to it, the benefits of ActiveRecord are all but dead on about the third day of development. The data mapper pattern adopted by SQLAlchemy (et. al.) takes all of the shitness of ActiveRecord and adds mind bending complexity to it.<p>SQL is easy to learn and very expressive. Why try and abstract it?<p>I spent years working with an ActiveRecord ORM I wrote myself in my feckless youth and thought that it was the answer to the world's problems. I didn't really understand why it was so terrible until I did a large project in Django and had to use someone <i>else's</i> ORM.<p>When I really analysed it, there were only three things that I really wanted out of an ORM:<p>1) Make the task of writing complex join statements a bit less tedious<p>2) Make the task of writing a sub-set of very basic where clauses slightly less tedious<p>3) Obviate the need for me to detect primary key changes when iterating over a joined result set to detect changes in an object (for example, looping over a list of Authors and their Books)<p>To that end, I wrote this:<p><a href="https://github.com/iaindooley/PluSQL" rel="nofollow">https://github.com/iaindooley/PluSQL</a><p>It's written in PHP because I like and use PHP but it's a very simple pattern that I would like to see elaborated upon/taken to other languages as I think it provides just the bare minimum amount of functionality to give some real productivity gains without creating a steep learning curve, performance trade-off or any barrier to just writing out SQL statements if that's the fastest way to solve the problem at hand.
Great advice:<p>"pick lightweight things that are known to work and see a lot of use outside your company, or else be prepared to become the “primary contributor” to the project."
Fabulous post. Thanks for writing.<p>One point it misses though is to test your backup strategy often. When you scale fast things break very often and it's good to be in practice of restoring from backups every now and then.
<i>I noticed that a particular “FUUUCCKKKKKasdjkfnff” wasn’t getting printed where it should have</i><p>Why not take the extra half a second to make those random strings meaningful and hidden behind a DEBUG log level?
'Even memcached, which is the conceptually simplest of these technologies and used by so many other companies, had some REALLY nasty memory corruption bugs we had to deal with, so I shudder to think about using stuff that’s newer and more complicated'<p>Does anyone know what memory corruption bugs they are referring to?
For the record, I use sqlalchemy 0.6.6 regularly under fairly heavy load, and have never had a problem with it. Any 'sqlalchemy bugs' are inevitably coding mistakes on my part.
I believe that the section on "The security-convenience tradeoff" is fundamentally flawed.<p>A username and password represent a pair. Neither one has meaning in terms of authentication without the other.<p>Take the example where I have forgotten my username (JohnGB), but try with what I think it is (Say JohnB), and enter the correct password for my actual username. The system would then tell me that my username is fine, but that my password isn't. From then on, I would be trying to reset the password for a different user as the system has already told me that my username was correct.<p>Please, for the sake of sane UX, don't do this!
A topic usually left out in scaling discussions is: how much can one predict? Or is it mostly trial and error? Is it mostly about good "reactive" engineering, would it have benefited from good mathematical modeling?
> <i>I noticed that a particular “FUUUCCKKKKKasdjkfnff” wasn’t getting printed where it should have</i><p>:)<p>I've never seen a shorter description of real-world software development. That's it in a nutshell!
Great article! Small nitpick from someone who just tried this on his server logs :)<p><pre><code> * on my machine xargs -I implies -L1, so you can drop that
* use gnuplot -p or the graphic will disappear immediately after rendering</code></pre>
There's a talk about Dropbox scaling at <a href="http://www.stanford.edu/class/ee380/winter-schedule-20112012.html" rel="nofollow">http://www.stanford.edu/class/ee380/winter-schedule-20112012...</a> .
Great article. Rajiv made it easy to understand the conceptual framework. The lesson is: always strive to be robust. Test your failure points deliberately. Applicable to more than just server scaling.
I'm surprised that Dropbox actually uses S3 internally to store data. All along I had assumed, wrongly, that Dropbox had built their own distributed storage cluster.
<p><pre><code> MySQL has a huge network of support and we were
pretty sure if we had a problem, Google, Yahoo,
or Facebook would have to deal with it and patch
it before we did. :)
</code></pre>
I am fairly certain Google is running its own (patched) version that's fairly different than the off-the-shelf MySQL.
Running with extra load seems inefficient in terms of energy consumption. Would it be possible to achieve the same thing by inserting delays or something that can be turned off?