Issues like the ones raised are why all sufficiently advanced database engine designs tend to evolve toward a kernel bypass architecture. From the perspective of a database engine, operating systems do a lot of "dumb" things with resource management and scheduling in ways that are essentially impossible to avoid that the database engine has enough context to do intelligently on its own. OS bypass in a database can have substantial performance benefits and add robustness for many edge cases. The obvious downside is that you basically end up rewriting the operating system in userspace, minus the device drivers, which is an enormous amount of work. It is not an incremental kind of design approach and it makes portability difficult. PostgreSQL actually does a really good job trying to be fast and robust without going to bypass internals.<p>I've always asserted that the reason we've never seen a proper kernel bypass database engine in open source is that the Minimum Viable Product is too complex. A bare, stripped-down, low-level database engine that does full bypass of the operating system is usually <i>at least</i> 100kLoC of low-level C/C++, and that is before you add all the features a database user will actually want. That is a big initial investment by some people with fairly rare software design skills.
I love that infinitely nerdy stuff like this can still make it to the HN homepage (there's still hope!).<p>Stuff like this really needs to make it into the PG tuning guide (<a href="https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server" rel="nofollow">https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Serv...</a>). The only place where it will ultimately be seen by a worthwhile audience.
Being a user of both Linux and PostrgreSQL, I'm very interested in this issue, but I only understand some of the words...<p>Could everybody wiser than me tell me if I should be concerned and the possible implications of these decisions? Should I invest in alternative platforms?
Point 2 is not actually the case, as long as your write does not partially fill a pagecache page (pagecache pages are the same size as the architecture's native page size - 4K on x86, x86-64 and arm).<p>You can demonstrate this with a program like the following:<p><pre><code> int main(int argc, char *argv[])
{
int i;
char pattern[512*1024];
int fd;
for (i = 0; i < sizeof pattern; i++)
pattern[i] = 'X';
fd = open("testfile", O_RDWR | O_CREAT | O_EXCL, 0666);
if (fd < 0)
{
perror("open");
return 1;
}
while (1)
{
pwrite(fd, pattern, sizeof pattern, 0);
fdatasync(fd);
posix_fadvise(fd, 0, sizeof pattern, POSIX_FADV_DONTNEED);
}
return 0;
}
</code></pre>
...then watch vmstat or iostat while it's running. Plenty of writes, no reads.<p>On the other hand, if you subtract one from the size of 'pattern', you'll see that you also get reads (as partially writing the last page requires a read-modify-write cycle).
I was under impression from reading another article that setting the correct kernel io scheduler to use helps with a related concept, maybe possibly: <a href="http://www.cybertec.at/postgresql-linux-kernel-io-tuning/" rel="nofollow">http://www.cybertec.at/postgresql-linux-kernel-io-tuning/</a>
For 1: I don't remember exactly which machines use NUMA, but I thought it was limited to 1st Opterons (and the behaviour makes sense)<p>2: Not sure, this may be specific to FS, or something that has to do with the behaviour of MMAPed files however I don't know how do you guarantee that what you're writing corresponds to a single block in the FS (unless you're writing directly to /dev/sda and even then)
This is slightly OT, but I always wondered: what is the overhead of fetching cached stuff from the file system vs. having a built-in cache? Is there a way to circumvent going through the kernel and avoid a context switch?<p>In other words: could a database have only minimal built-in caching and instead rely on the OS cache?
If we had some metrics in a graph form, backing up the post would be great for us <i>mere mortals</i> to understand which way to go in order to achieve maximal performance.
Not to detract from the very intelligent and reasoned posting, but what tiny percentage of people honestly still use fat-ass RDBMS as their primary datastore <i>and</i> would be better off performance tuning it at the kernel IO level than actually analyzing their load and subsequently sharding or migrating their data structures to less behemoth-like datastores? <i>Yes</i>, RDBMS are easy to hire developers and DBAs for, are well supported and full-featured. <i>However</i>, in this day and age using them just feels a little ... lazy ... for most workloads.