I'm curious how various companies provide their devs with data to use for local development of their software. Particularly when the database spans hundreds of tables and GBs of data.<p>Do you get a sql dump? how often do you get updated dumps? Is the data anonymized at all? For really large databases do you trim the tables?<p>If you don't use database dumps, what method do you use to generate data to develop against locally?
For several customer projects I periodically create a new dev database from a production snapshot on AWS RDS. Then I run an anonymizer script over it to change names and email addresses and phone numbers and so on, so the dev data looks real but doesn’t refer to real people. For one big database the anonymizer script can also cull data, so I can set maximum 1,000 customer records (chosen at random) and remove all related rows. The result is small enough to dump for local development.<p>In actual use almost all of the devs I work with connect to a dev database in the cloud, rather than run locally.<p>I can’t count how many deployments I’ve seen delayed and broken because devs have different versions of tools on their laptop, or different environment settings, paths, libraries, etc. I discourage local development for that reason and push everyone to use a dev system (cloud server) that is a known good copy of production.
For development typically I have used <a href="https://www.liquibase.org/" rel="nofollow">https://www.liquibase.org/</a>. This is a change management tool for databases so devs can share the changes via source code and automate the databases upgrade process, but it also allows the addition of data in SQL form. You can use the tool to extract the existing structure and data from certain tables, so you can fairly easily automate the population of data and test data.<p>I think it is a better approach to push changes from dev into test and then into production. Data might come back the other way but it is easier to deal with in a reviewable and selectable format as needing big dumps of it is specific to certain tasks and by no means most tasks. Developers will have different structures locally while they develop and once given the tools to refactor databases they do so. I think database dumps are a smell of a broken change management system every time I have seen them.
I worked at a company that had all our clients' salesforce, emails and internal communications in our production database. It was the most security minded company I ever worked for in 25 years. They didn't allow windows. Only macs on their network.<p>Anyway, our staging and development database had crap data. I asked if we could get production data in staging and they said no.
We do have the ability to do sql db dumps but usually dont as that would be pretty time consuming i assume, if i am about to do something that could potentially screw up dev environment db, i would just tell my manager and lead dev of the sprint before hand and check which dba is available beforehand to undue my changes if things start breaking.<p>Mongo dumps are much easier to do so i do dump mongo onto local.
In our case (Vertica + streaming big data) we don't need the full data. Most of the time we only need a few weeks of data or even maybe a couple of days, so the amount is minimum and we write scripts to import from PROD to DEV.<p>We don't have DEV in local laptops though. All data is still contained in a local data center.
7 years ago yes - it mostly just me. Now days no. We have way too much sensitive data and it’s easy to generate good data by just using our product while developing features and we have seed data for the essential bits... like everything it’s work to develop seed files but worth it. Both security of your customer data and ease of setting up solid dev environments...
As a dev I receive all forms of dumps. Some anon, some to anon myself, sometimes I have to hook up to a remote server over VPN, sometimes my IP is waitlisted against a cloud instance... seems to vary vastly in my experience. It depends very much on the data contained in the db too.
We have development databases with demo data. We're expected to apply schema changes the same way it's done in production, whether that's running a utility that applies scripts or manually running the scripts or whatever.
previous industry: particular production db had a few terabytes of data but didn't have customer data. We could just get a dump of prod db, not super sensitive, no drama. After we migrated it to aws rds we could just spin up new dbs from snapshots.<p>One colleague spent a few weeks pruning a db snapshot down to something minimal ish so our dumb old suite of regression tests with uncontrolled data dependencies could still pass. One off manual activity.
Depends on the data. If it includes potentially personally identifying information, it doesn’t get handed to devs for testing purposes, at least not in raw form. The potential GDPR penalties are astronomical.<p>It’s easy to generate fake data that approximates the characteristics of your real data set. You can also create a CI/CD test environment that allows developers to test against the database without taking possession of the database. Sure, a malicious developer could use such a system to exfiltrate the database by running malicious code, but that’s no different than allowing the same devs to push code to production machines