A bunch of thoughts from a guy running postgres on a large site:<p>A restore rate of 100Gb/hour is way too low.<p>Over a 10GigE interface with a semi-decent SSD RAID you should be pushing 1Gb/second <i>easily</i>.<p>Also, pgbouncer is never the problem. If it’s complaining it’s either configured wrong or the database is having a bad time.<p>I’d be scared out of my mind if this was my service and I was running it all on a single master with no usable backups. Then making it worse by letting production traffic hit it thereby moving the database further and further from when the isssue occurred.<p>Those snapshots are going to be unusable unlsss the underlying FS is frozen before the snapshot gets taken.<p>Honestly reading the notes (and not knowing anything about the team) they need less DevOps and more SysAdmin/DBA.