Lately, I've seen a lot of companies with offerings for remote development environments and preview / ephemeral environments (disclaimer, I work for one of these companies).<p>However because so many people seem to have solved this issue (each with their own twist though), I've been thinking about that creating the environments and deploying them may not be the important part? Perhaps it is the DX around the data that is supplied to these environments. I think the data can come in a few different manners:<p>* A brand new database with the correct schema attached, however there is no data. This is useful to be able to use the environment, but a pain if you have to insert a lot of data to be able to use the environment. This is probably the easiest and could be done through a container running Postgres/Mysql/etc<p>* A database with the correct schema and some seed data. This fixes the scaffolding issue so the environment is immediately usable but doesn't allow for reproducing bugs from production. Again, this could be done through a container.<p>* A production cloned database. This allows for debugging any production bugs or issues. However without being able to scrub PII from the data, this opens security concerns. This approach could be done through a container, but doing say pg_restore into a container is going to make the startup time for this environment a lot slower. Another approach is to use something like RDS and backup snapshots to create the clones and have them ready to be used by an environment before hand.<p>* A scrubbed production clone. Same as above, but with all the PII removed. I think this is the top tier and would give the most benefits to developers without the security concerns.<p>I'm curious what other peoples thoughts are around this topic and how / if your company is providing production like data to the development process?
Where I work today I have a complete copy of the database but not complete copies of all the files the system uses. Most of the time I can debug subtle problems from the production system quickly because I have all the data and if it has to do with recent data I can get a newer backup. When I work on features that need auxiliary files (images) I will go an download those individually for a few data records.<p>I’ve worked at other places where I could do the same and the volume of data was quite a burden which may or may not have been a problem, for instance when I worked on arXiv I had all the papers on my machine. Other times I have a reduced data set, when I worked on a patent search engine I needed to have about 10,000 documents to be able to train a good neural network but the full build for all the patents and non-patent documents took a day or two to build on a cluster.
I would lean towards a "scrubbed production clone".<p>Maybe have a nightly (or whatever) process that does this automatically and dumps it into the QA environment which your CI/CD tests run against.<p>Then if a dev wants the latest data, they just copy it from the QA environment (or protected s3 bucket) and they are ready to roll.