> For each page operation, we'd need to read or write an entire 128K record, resulting in a 16x overhead for many operations.<p>Not so versed in ZFS but is the default 128K record not a max record. As in ZFS can write with smaller data sizes based upon the application write pattern. So by setting Postgres to 32K, your reducing overhead in the amount of writes from postgres, while ZFS was not the issue.<p>The idea of limiting ZFS to 16 or 32K record sizes is grounded more in read performance, as it will read the max block size. So by limiting ZFS to 16 or 32K, you decrease that -30% read penalty that you normally get.<p>Again, maybe i am wrong on this.<p>* Q: Why not use a system like YugabyteDB or Cockroachdb, what seem to more meet the demand for such large table structure?<p>As with a 1 million records growth per hour, storage is going to hit limits at a specific point.<p>* Q: How do you even deal with searching 100B rows with wildcards (as on the front page)?