Wow, the title of this post is very calm compared to what is actually happening.<p>CloudSQL Postgres is running with a misconfigured OS OOM killer, crashes Postmaster randomly even if memory use is below instance spec. GCP closes this bug report as "Won't fix".<p>This is a priority 1 issue. Seeing a wontfix for this has completely destroyed my trust of their judgement. The bug report states that they have been in contact with support since February.<p>Unbelievable attitude towards fixing production critical problems of their platform affecting all customers.
Are there any good/recommended books or resources for someone who wants to learn how to run postgresql well? E.g, what defaults to change and when, settings for the host OS (such as in the parent linked article), overall tips/insights/recommendations.
Are there recommendations for learning about Linux kernel memory management? Two anecdata:<p>* I had some compute servers that were up for 200 days. The customers noticed that they were half as fast as identical hardware just booted. Dropping the file system cache ("echo 3 | sudo dd of=/proc/sys/vm/drop_cache") brought the speed back up to the newly deployed servers. WTF? File system caches are supposed to be zero cost discards as soon as processes ask for RAM - but something else is going on. I suspect the kernel is behaving badly with overpopulated RAM management data (TLB entries?), but I don't know how to measure that.<p>* If that is actually the problem, then a solution might be to decrease data size by using non-zero hugepages ("cat /proc/sys/vm/nr_hugepages"). I'd love to see recommendations on when to use that.
I recently managed to crash a GCP cloudsql postgres 12 host running an interactive query that was rather heavy (based on error logs OOM).<p>It surprised me because I had never executed a query and caused the whole host to crash up until that point - now I'm wondering if this misconfiguration is the cause
Interesting. Also a problem with RDS: <a href="https://stackoverflow.com/questions/52148675/aws-rds-with-postgres-is-oom-killer-configured" rel="nofollow">https://stackoverflow.com/questions/52148675/aws-rds-with-po...</a>
I'd like to thank the author for their clear, simple explanation. I haven't had to think about allocating memory since university and am not practiced thinking about it in my software but now I feel like I have useful ways to think about why processes just disappear sometimes.
GCP CloudSQL has a lot of issues. There was one with query insights being enabled causing segfaults on `LEFT JOIN` operations. Its since been patched, but really shitty.