The problem with benchmarks are that it's really, really difficult to emulate real-world conditions. However, here some of the more obvious points that are unrealistic.<p>1) Comparing between same number of cores. The core count selected for each testing level is completely arbitrary. With both web and database servers, which scale well to increasing core count, single-threaded performance is generally less of a concern and should not be a point of measure aside from average page load time. Some server configurations are optimized for higher numbers of slower cores, while others are optimized for fewer but faster cores. By comparing like core counts, this testing is highly skewed to the latter.<p>Comparing packages at the same price point, or just how the package fits into the product offering lineup (smallest, median, largest instance) would be much more fair to compare. If 4 cores at one provider costs the same as 1 core at another, it should be fair to compare the two at different core counts.<p>2) Server configurations. For both web and database servers, the best performance optimization that can be done is to cache to RAM as much as possible. With increased caching, the need for disk I/O also goes down significantly, and can easily be by as much as an order of magnitude. Serving static content uses minimal resources and is mostly dependent on network performance. Dynamic content is more CPU intensive, and most of the time you can and should be caching the compiled opcode/bytecode. Most website database usage is read heavy, and many of the queries can be cached as well. The one drawback to a heavy emphasis on caching is that if the server restarts, there may not be enough resources to service all requests while warming up the cache. However, given that dynamic loads is precisely what cloud offerings are supposed to excel at, you can spin up additional instances at these times, or just take a horizontally scaled approach to begin with so that a single instance failing will not have a major impact on your aggregate load.<p>3) Synthetic benchmarks, by their very nature, do a poor job of emulating real world performance. The best way to benchmark both web server and database is to take a real site, log all the requests, and replay the logs. What you want to measure for is the maximum number of requests or queries that can be served, average time and standard deviation at different requests/query rates, etc.<p>4) Network speed tests. The biggest mistake that most tests make is that they measure performance from content network to content network, rather than from content network to eyeball network. Especially with the current peering issues going on between carriers and eyeballs, this is more important than ever. This is a very difficult problem to solve however, as it's not easy to do throughput tests from a large number of different eyeball networks. You would have to take a very large number of client generated results, and compare differences for all the different providers in all their different locations, which would be nearly impossible. The next best thing, while still a lot of work but more feasible, is to collect up IP's for eyeball networks for as many different locations as possible, but perhaps just the top X number of cities by population, and run continuous pings/traceroutes over an extended period of time. You can then just use average latency, standard deviation, and packet loss % as the metrics.