Sysadmin mistakes start-ups make

164 pointsby polviover 15 years ago

18 comments

DrJokepuover 15 years ago

Here's one we made recently: Purchasing an array of hard drives (for storage servers) and not making sure that not all of them are from the same batch. Since they were made in the same batch, they had the same defects and when they failed, they failed one after each other in a very short interval. Since all of them failed, RAID didn't help, we had to restore the day-old offline backup.

评论 #893107 未加载

评论 #893820 未加载

评论 #894190 未加载

评论 #892879 未加载

评论 #893001 未加载

jrockwayover 15 years ago

Fork is actually a very fast system call. It never blocks, and (on Linux), only involves copying a very small amount of bookkeeping information. If you exec right after the fork, there is basically no overhead.However, forking a new shell to parse "mv foo bar" is more expensive than just using the rename system call. And it's easier to check for errors, and so on.SQLite is also not as slow as people think it is; you can easily handle 10s of millions of requests per day with it. If your application's semantics require table locks, MySQL and Postgres are not going to magically eliminate competition for locks. It's just that they both pick very weak locking levels by default. (They run fast, but make it easy to corrupt your data. Incidentally, I think they do this not for speed, but so that transactions never abort. Apparently that scares people, even though it's the whole point of transactions. </rant>.)Most of my production apps are SQLite or BerekelyDB, and they perform great. I am not Google, however.

评论 #893361 未加载

评论 #893479 未加载

peterwwillisover 15 years ago

The memory use is not accurate unless you take shared pages into account. Copy-on-write will make it look like each apache child is using 40MB, when really it's only 10MB private RSS. Use a RSS-calculating script (<a href="http://psydev.syw4e.info/new/misc/meminfo.pl" rel="nofollow">http://psydev.syw4e.info/new/misc/meminfo.pl</a>) to determine the close-to-real memory use. If you don't calculate your maximum memory use correctly you will run into swap with traffic peaks. Also keep in mind that swap is a good thing. Is your app constantly cycling children? This isn't going to allow it to move unused/shared memory into swap. Don't ignore memory leaks by reducing your max requests per child.The forking thing is more of the same. Copy-on-write means it's not going to balloon your memory unless some function turns that shared rss into private. It isn't something that you want to do a lot of, though.

评论 #893227 未加载

SwellJoeover 15 years ago

One of the most common problems we see is DNS misconfiguration. It seems most folks just haven't read the grasshopper book. If you're doing anything on the Internet, you need a basic understanding of DNS.Once you grasp the fundamentals, most DNS problems become completely transparent, but I've seen people spend weeks trying to solve DNS problems due to lack of understanding.

评论 #893080 未加载

评论 #893251 未加载

评论 #893659 未加载

评论 #893286 未加载

abalashovover 15 years ago

In my experience of the most common mistakes is the failure to realise that on pretty much all Linux distros, services like Apache and MySQL come conservatively tuned. This is deliberate; it means a DoS or out-of-control process within one of those domains is unlikely to take out the entire server, because there's a hard limit on consumption of memory, CPU, child processes, threads, etc.However, this default configuration needs to be tuned to allow you to take advantage of the hardware - if you have generous hardware. Otherwise, you will wonder why your web sites are extremely unresponsive, yet the server load stands at something relatively unimpressive.I found this out the first time a blog post on one of my servers got digg'd.

michaelbuckbeeover 15 years ago

I'd guess the real number one mistake is insufficient paranoia about backups.I know lots of companies doing TDD but that have never done a full test restore from their backups.

评论 #893067 未加载

tptacekover 15 years ago

This was a great article, but I ended it wondering whether they either (a) knew what a system call was (until the end, I thought maybe they meant a system() shell-out) or (b) realize how many system calls a vanilla request/response cycle incurs.

aristusover 15 years ago

I disagree with 1.3. "Serving static content is the easiest possible task for any web server." Yes, but keeping connections open for slow clients (esp with KeepAlive on) is not a good use of your 500MB Mongrel process' time. On the other hand, KeepAlive is a handy thing to have.Using a proxy like nginx or varnish to serve static files (and even dynamic data) if you have the proper KeepAlive and Nagle bits flipped can save you a lot of server resources at the application layer.

评论 #892931 未加载

评论 #892941 未加载

评论 #892906 未加载

julio_the_squidover 15 years ago

Yep, #1 happened to me the other day. We hit our Apache server limit of 256 and the site slowed to a crawl. I'm not really sure what was causing the load to be like 50-90, but requests were quite delayed waiting for an open process (keepalive was at 5 secs).Indeed, my first idea was indeed to install nginx for images really quick. However, I have no experience with nginx. Thankfully, we had a spare server and I offloaded the images to there for now... Throwing more hardware at the problem usually works.

ajrossover 15 years ago

FTA:However, sqlite should never be used in production. It is important to remember that sqlite is single flat file, which means any operation requires a global lockI don't know jack about sqlite's locking architecture or scalability, but this statement is just silly. There are a conceptually infinite number of ways to make fine-grained locking work on a single file, both within a single process, a single host, or across a network. Maybe the author is thinking fcntl() locking is somehow the only option.I guess the corrolary to this article has to be "Don't let your startup's sysadmins diagnose development-side issues."

评论 #892975 未加载

评论 #892981 未加载

评论 #893028 未加载

评论 #894483 未加载

评论 #893019 未加载

评论 #893387 未加载

absconditusover 15 years ago

How are the last two system administration problems?

rythieover 15 years ago

This seems like an odd section of sysadmin mistakes - I would have thought there are some other ones being made more often.

评论 #893003 未加载

bclover 15 years ago

I'd say their biggest mistake is usually not hiring a sysadmin who also has development experience (or developers without sysadmin experience). I've found that my knowledge in both realms has been invaluable in determining how to design the infrastructure and how to write the code.

patio11over 15 years ago

If you fork inside an app server, such as mod_python, you will fork the entire parent process (apache!). This could happen by calling something like os.system("mv foo bar") from a python application.I nominate this post as the most distressingly important bit of information I've ever received at 2:43 AM in the morning.Now the question: what can I do in Ruby to avoid the four calls a second or so I'm currently making to system(big_command_to_invoke_imagemagick) ?

评论 #892918 未加载

评论 #892902 未加载

评论 #892892 未加载

评论 #892904 未加载

评论 #892891 未加载

toisanjiover 15 years ago

hmm, I have a hard time understanding why anyone would try to use sqlite in production unless they explicitly wanted to?

评论 #893386 未加载

joshOiknineover 15 years ago

Personally I don't think we would ever run into those issues. A. we don't have other servers to switch over to B. We are using MySQL for testing and development and C. we don't like what happens when we make system calls for within a web app. forget about forking.

duranaover 15 years ago

Take #1 and generalize it to the mistake of trying to fix a problem without really understanding what the problem is. This has to be the most common mistake I've seen in the sysadmin world.

c00p3rover 15 years ago

Is this a example of a knowledge level of modern sysadmin? If so, we're in trouble. =)Sysadmin should be able to think in terms of data flows, which means memory management, data partitioning, and network stack usage, able to put different types of data into different kinds of storage, and understand the role of cache and how data should be access.Packages are just a tools.