TechEcho

12 comments

Issues like the ones raised are why all sufficiently advanced database engine designs tend to evolve toward a kernel bypass architecture. From the perspective of a database engine, operating systems do a lot of "dumb" things with resource management and scheduling in ways that are essentially impossible to avoid that the database engine has enough context to do intelligently on its own. OS bypass in a database can have substantial performance benefits and add robustness for many edge cases. The obvious downside is that you basically end up rewriting the operating system in userspace, minus the device drivers, which is an enormous amount of work. It is not an incremental kind of design approach and it makes portability difficult. PostgreSQL actually does a really good job trying to be fast and robust without going to bypass internals.I've always asserted that the reason we've never seen a proper kernel bypass database engine in open source is that the Minimum Viable Product is too complex. A bare, stripped-down, low-level database engine that does full bypass of the operating system is usually at least 100kLoC of low-level C/C++, and that is before you add all the features a database user will actually want. That is a big initial investment by some people with fairly rare software design skills.

评论 #7523777 未加载

评论 #7523978 未加载

评论 #7523342 未加载

评论 #7523337 未加载

评论 #7523419 未加载

评论 #7524181 未加载

评论 #7524470 未加载

moeabout 11 years ago

I love that infinitely nerdy stuff like this can still make it to the HN homepage (there's still hope!).Stuff like this really needs to make it into the PG tuning guide (<a href="https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server" rel="nofollow">https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Serv...</a>). The only place where it will ultimately be seen by a worthwhile audience.

评论 #7521686 未加载

评论 #7522733 未加载

评论 #7521720 未加载

评论 #7522108 未加载

davidpardoabout 11 years ago

Being a user of both Linux and PostrgreSQL, I'm very interested in this issue, but I only understand some of the words...Could everybody wiser than me tell me if I should be concerned and the possible implications of these decisions? Should I invest in alternative platforms?

评论 #7521802 未加载

评论 #7521577 未加载

评论 #7521688 未加载

评论 #7523365 未加载

评论 #7521747 未加载

评论 #7521482 未加载

forkandwaitabout 11 years ago

Anybody know of a FreeBSD comparison?

评论 #7521465 未加载

评论 #7521478 未加载

评论 #7521397 未加载

评论 #7525054 未加载

cafabout 11 years ago

Point 2 is not actually the case, as long as your write does not partially fill a pagecache page (pagecache pages are the same size as the architecture's native page size - 4K on x86, x86-64 and arm).You can demonstrate this with a program like the following:<pre><code> int main(int argc, char *argv[]) { int i; char pattern[512*1024]; int fd; for (i = 0; i < sizeof pattern; i++) pattern[i] = 'X'; fd = open("testfile", O_RDWR | O_CREAT | O_EXCL, 0666); if (fd < 0) { perror("open"); return 1; } while (1) { pwrite(fd, pattern, sizeof pattern, 0); fdatasync(fd); posix_fadvise(fd, 0, sizeof pattern, POSIX_FADV_DONTNEED); } return 0; } </code></pre> ...then watch vmstat or iostat while it's running. Plenty of writes, no reads.On the other hand, if you subtract one from the size of 'pattern', you'll see that you also get reads (as partially writing the last page requires a read-modify-write cycle).

spaznodeabout 11 years ago

I was under impression from reading another article that setting the correct kernel io scheduler to use helps with a related concept, maybe possibly: <a href="http://www.cybertec.at/postgresql-linux-kernel-io-tuning/" rel="nofollow">http://www.cybertec.at/postgresql-linux-kernel-io-tuning/</a>

raverbashingabout 11 years ago

For 1: I don't remember exactly which machines use NUMA, but I thought it was limited to 1st Opterons (and the behaviour makes sense)2: Not sure, this may be specific to FS, or something that has to do with the behaviour of MMAPed files however I don't know how do you guarantee that what you're writing corresponds to a single block in the FS (unless you're writing directly to /dev/sda and even then)

评论 #7526132 未加载

MrBuddyCasinoabout 11 years ago

This is slightly OT, but I always wondered: what is the overhead of fetching cached stuff from the file system vs. having a built-in cache? Is there a way to circumvent going through the kernel and avoid a context switch?In other words: could a database have only minimal built-in caching and instead rely on the OS cache?

评论 #7522083 未加载

评论 #7523094 未加载

atmosxabout 11 years ago

If we had some metrics in a graph form, backing up the post would be great for us mere mortals to understand which way to go in order to achieve maximal performance.

lallysinghabout 11 years ago

For (1), you can use numactl to change the default behavior. For #s (2) and (3), I'm pretty sure this is FS dependent?

rivertabout 11 years ago

related post on lwn: <a href="https://lwn.net/Articles/591723/" rel="nofollow">https://lwn.net/Articles/591723/</a>

contingenciesabout 11 years ago

Not to detract from the very intelligent and reasoned posting, but what tiny percentage of people honestly still use fat-ass RDBMS as their primary datastore and would be better off performance tuning it at the kernel IO level than actually analyzing their load and subsequently sharding or migrating their data structures to less behemoth-like datastores? Yes, RDBMS are easy to hire developers and DBAs for, are well supported and full-featured. However, in this day and age using them just feels a little ... lazy ... for most workloads.

评论 #7522726 未加载

评论 #7521560 未加载

评论 #7521653 未加载

评论 #7521616 未加载

12 comments

jandrewrogersabout 11 years ago

评论 #7523777 未加载

评论 #7523978 未加载

评论 #7523342 未加载

评论 #7523337 未加载

评论 #7523419 未加载

评论 #7524181 未加载

评论 #7524470 未加载

moeabout 11 years ago

评论 #7521686 未加载

评论 #7522733 未加载

评论 #7521720 未加载

评论 #7522108 未加载

davidpardoabout 11 years ago

评论 #7521802 未加载

评论 #7521577 未加载

评论 #7521688 未加载

评论 #7523365 未加载

评论 #7521747 未加载

评论 #7521482 未加载

forkandwaitabout 11 years ago

Anybody know of a FreeBSD comparison?

评论 #7521465 未加载

评论 #7521478 未加载

评论 #7521397 未加载

评论 #7525054 未加载

cafabout 11 years ago

spaznodeabout 11 years ago

raverbashingabout 11 years ago

评论 #7526132 未加载

MrBuddyCasinoabout 11 years ago

评论 #7522083 未加载

评论 #7523094 未加载

atmosxabout 11 years ago

If we had some metrics in a graph form, backing up the post would be great for us mere mortals to understand which way to go in order to achieve maximal performance.

lallysinghabout 11 years ago

For (1), you can use numactl to change the default behavior. For #s (2) and (3), I'm pretty sure this is FS dependent?

rivertabout 11 years ago

related post on lwn: <a href="https://lwn.net/Articles/591723/" rel="nofollow">https://lwn.net/Articles/591723/</a>

Subtly Bad Things Linux May Be Doing To PostgreSQL

12 comments

Subtly Bad Things Linux May Be Doing To PostgreSQL

12 comments