Here's one we made recently: Purchasing an array of hard drives (for storage servers) and not making sure that not all of them are from the same batch. Since they were made in the same batch, they had the same defects and when they failed, they failed one after each other in a very short interval. Since all of them failed, RAID didn't help, we had to restore the day-old offline backup.
Fork is actually a very fast system call. It never blocks, and (on Linux), only involves copying a very small amount of bookkeeping information. If you exec right after the fork, there is basically no overhead.<p>However, forking a new shell to parse "mv foo bar" is more expensive than just using the rename system call. And it's easier to check for errors, and so on.<p>SQLite is also not as slow as people think it is; you can easily handle 10s of millions of requests per day with it. If your application's semantics require table locks, MySQL and Postgres are not going to magically eliminate competition for locks. It's just that they both pick very weak locking levels by default. (They run fast, but make it easy to corrupt your data. Incidentally, I think they do this not for speed, but so that transactions never abort. Apparently that scares people, even though it's the whole point of transactions. </rant>.)<p>Most of my production apps are SQLite or BerekelyDB, and they perform great. I am not Google, however.
The memory use is not accurate unless you take shared pages into account. Copy-on-write will make it look like each apache child is using 40MB, when really it's only 10MB private RSS. Use a RSS-calculating script (<a href="http://psydev.syw4e.info/new/misc/meminfo.pl" rel="nofollow">http://psydev.syw4e.info/new/misc/meminfo.pl</a>) to determine the close-to-real memory use. If you don't calculate your maximum memory use correctly you will run into swap with traffic peaks. Also keep in mind that swap is a <i>good</i> thing. Is your app constantly cycling children? This isn't going to allow it to move unused/shared memory into swap. Don't ignore memory leaks by reducing your max requests per child.<p>The forking thing is more of the same. Copy-on-write means it's not going to balloon your memory unless some function turns that shared rss into private. It isn't something that you want to do a lot of, though.
One of the most common problems we see is DNS misconfiguration. It seems most folks just haven't read the grasshopper book. If you're doing anything on the Internet, you <i>need</i> a basic understanding of DNS.<p>Once you grasp the fundamentals, most DNS problems become completely transparent, but I've seen people spend <i>weeks</i> trying to solve DNS problems due to lack of understanding.
In my experience of the most common mistakes is the failure to realise that on pretty much all Linux distros, services like Apache and MySQL come conservatively tuned. This is deliberate; it means a DoS or out-of-control process within one of those domains is unlikely to take out the entire server, because there's a hard limit on consumption of memory, CPU, child processes, threads, etc.<p>However, this default configuration needs to be tuned to allow you to take advantage of the hardware - if you have generous hardware. Otherwise, you will wonder why your web sites are extremely unresponsive, yet the server load stands at something relatively unimpressive.<p>I found this out the first time a blog post on one of my servers got digg'd.
I'd guess the real number one mistake is insufficient paranoia about backups.<p>I know lots of companies doing TDD but that have never done a full test restore from their backups.
This was a great article, but I ended it wondering whether they either (a) knew what a system call was (until the end, I thought maybe they meant a system() shell-out) or (b) realize how many system calls a vanilla request/response cycle incurs.
I disagree with 1.3. "Serving static content is the easiest possible task for any web server." Yes, but keeping connections open for slow clients (esp with KeepAlive on) is not a good use of your 500MB Mongrel process' time. On the other hand, KeepAlive is a handy thing to have.<p>Using a proxy like nginx or varnish to serve static files (and even dynamic data) if you have the proper KeepAlive and Nagle bits flipped can save you a <i>lot</i> of server resources at the application layer.
Yep, #1 happened to me the other day. We hit our Apache server limit of 256 and the site slowed to a crawl. I'm not really sure what was causing the load to be like 50-90, but requests were quite delayed waiting for an open process (keepalive was at 5 secs).<p>Indeed, my first idea was indeed to install nginx for images really quick. However, I have no experience with nginx. Thankfully, we had a spare server and I offloaded the images to there for now... Throwing more hardware at the problem usually works.
FTA:<p><i>However, sqlite should never be used in production. It is important to remember that sqlite is single flat file, which means any operation requires a global lock</i><p>I don't know jack about sqlite's locking architecture or scalability, but this statement is just silly. There are a conceptually infinite number of ways to make fine-grained locking work on a single file, both within a single process, a single host, or across a network. Maybe the author is thinking fcntl() locking is somehow the only option.<p>I guess the corrolary to this article has to be "Don't let your startup's sysadmins diagnose development-side issues."
I'd say their biggest mistake is usually not hiring a sysadmin who also has development experience (or developers without sysadmin experience). I've found that my knowledge in both realms has been invaluable in determining how to design the infrastructure and how to write the code.
<i>If you fork inside an app server, such as mod_python, you will fork the entire parent process (apache!). This could happen by calling something like os.system("mv foo bar") from a python application.</i><p>I nominate this post as the most distressingly important bit of information I've ever received at 2:43 AM in the morning.<p>Now the question: what can I do in Ruby to avoid the four calls a second or so I'm currently making to system(big_command_to_invoke_imagemagick) ?
Personally I don't think we would ever run into those issues. A. we don't have other servers to switch over to B. We are using MySQL for testing and development and C. we don't like what happens when we make system calls for within a web app. forget about forking.
Take #1 and generalize it to the mistake of trying to fix a problem without really understanding what the problem is. This has to be the most common mistake I've seen in the sysadmin world.
Is this a example of a knowledge level of modern sysadmin? If so, we're in trouble. =)<p>Sysadmin should be able to think in terms of data flows, which means memory management, data partitioning, and network stack usage, able to put different types of data into different kinds of storage, and understand the role of cache and how data should be access.<p>Packages are just a tools.