Show HN: A tool to seed your dev database with real data

129 pointsby ev0xmusicabout 3 years ago

A bunch of developers and myself have created RepliByte - an open-source tool to seed a development database from a production database.Features:<pre><code> - Support data backup and restore for PostgreSQL, MySQL and MongoDB - Replace sensitive data with fake data - Works on large database (> 10GB) (read Design) - Database Subsetting: Scale down a production database to a more reasonable size - Start a local database with the prod data in a single command - On-the-fly data (de)compression (Zlib) - On-the-fly data de/encryption (AES-256) - Fully stateless (no server, no daemon) and lightweight binary - Use custom transformers </code></pre> My motivation:As a developer, creating a fake dataset for running tests is tedious. Plus, it does not reflect the real-world data and painful to keep updated. If you prefer to run your app tests with production data. Then RepliByte is for you as well.Available for MacOSX, Linux and Windows.> <a href="https://github.com/qovery/replibyte" rel="nofollow">https://github.com/qovery/replibyte</a>

17 comments

mdanielabout 3 years ago

Please don't require static AWS credentials: <a href="https://github.com/Qovery/replibyte/blob/v0.4.4/replibyte/src/bridge/s3.rs#L36-L37" rel="nofollow">https://github.com/Qovery/replibyte/blob/v0.4.4/replibyte/sr...</a>or at least either include "AWS_SESSION_TOKEN" in that setup (if it is present) in order to allow "aws sts assume-role" to work, or allow `AWS_PROFILE`, or just use the aws-sdk's normal credential discovery mechanism which at least on their "main" SDKs is a fallback list of them, but I couldn't follow the docs.rs soup in order to know if their rust sdk is up to speed or what

评论 #31186527 未加载

评论 #31189099 未加载

评论 #31187183 未加载

评论 #31186089 未加载

krageonabout 3 years ago

Unless you can exhaustively guarantee your customer-data containing production data will definitely be transformed into something completely unrecognisable and irreversible (and let's face it, you can never do so - systems change all the time), using this is irresponsible. The fact that the motivation for it is that it is "tedious" to do the right way doesn't exactly inspire confidence, though it is definitely in the spirit of the times.

评论 #31190131 未加载

time4teaabout 3 years ago

From Thoughtworks Tech Radar <a href="https://www.thoughtworks.com/radar" rel="nofollow">https://www.thoughtworks.com/radar</a>21. Production data in test environments Hold We continue to perceive production data in test environments as an area for concern. Firstly, many examples of this have resulted in reputational damage, for example, where an incorrect alert has been sent from a test system to an entire client population. Secondly, the level of security, specifically around protection of private data, tends to be less for test systems. There is little point in having elaborate controls around access to production data if that data is copied to a test database that can be accessed by every developer and QA. Although you can obfuscate the data, this tends to be applied only to specific fields, for example, credit card numbers. Finally, copying production data to test systems can break privacy laws, for example, where test systems are hosted or accessed from a different country or region. This last scenario is especially problematic with complex cloud deployments. Fake data is a safer approach, and tools exist to help in its creation. We do recognize there are reasons for specific elements of production data to be copied, for example, in the reproduction of bugs or for training of specific ML models. Here our advice is to proceed with caution.

评论 #31189685 未加载

gregwebsabout 3 years ago

Using a customer's production data outside of production probably violates their expectations of your data security practices. I couldn't see myself using this unless there was a mode where only allowed fields are copied and non-id fields are first transformed in a lossy way.

评论 #31189627 未加载

评论 #31189118 未加载

评论 #31188856 未加载

jlgaddisabout 3 years ago

I'm (not) looking forward to the future data breach notifications / post-mortems that include something like "... our developers used a tool to copy the production database to a dev database on their laptop ..."Honestly, I'm kinda surprised by the lack of comments advocating against doing this.

claytongulickabout 3 years ago

The comments I've read on here seem strangely negative, I don't understand why.I think this tool looks great!I appreciate the time and effort you put on to releasing a free and open source tool to help solve a real problem.Keep up the great work!

评论 #31189200 未加载

tjpnzabout 3 years ago

I like this but after a cursory glance at the source I have a few concerns:- There's a transformer which appears to retain the first char on string fields. That's not safe if you're dealing with customer data.- Remove telemetry. That it's claimed to be anonymized and togglable is meaningless where sensitive data is concerned.

评论 #31196974 未加载

nicoburnsabout 3 years ago

This does sounds kind of useful. On the other hand I performed a similar task just yesterday using the native pg_dump and pg_restore commands, and it only took a couple of hours to setup (and now I have a repeatable script), so this’ll need to be implemented really well to provide value.

评论 #31189666 未加载

评论 #31186965 未加载

micheljansenabout 3 years ago

This is actually a much harder problem than it seems. GDPR is quite strict about what is considered PII (and rightly so). For example: you may think replacing sensitive data with fake data is enough to anonymise customer data. It's not:> "Personal data is any information that relates to an identified or identifiable living individual. Different pieces of information, which collected together can lead to the identification of a particular person, also constitute personal data."So it's not enough to, for example, replace all names, addresses etc. when you can still see which products someone has interacted with, when their account was created (which in the production DB would relate back to their actual account!) or any other unexpected pieces of information that links back to their identity.In practice, this means that any realistic production-derived data is either very likely to be still considered PII (and therefore much more demanding to handle safely and securely) or has to be mangled so much that it is no longer representative of production data.

dvasdekisabout 3 years ago

I was thinking "oh! this is awesome!", and then noticed it didn't support MSSQL. Not to worry, I'll just contribute a connector. Let's take a look at their existing connector code...<a href="https://github.com/Qovery/replibyte/blob/main/replibyte/src/source/postgres.rs" rel="nofollow">https://github.com/Qovery/replibyte/blob/main/replibyte/src/...</a>Not a single comment to say what anything does. Sigh. It's the same for the other drivers too.

评论 #31189089 未加载

husainfazelabout 3 years ago

<pre><code> - Works on large database (> 10GB) (read Design) </code></pre> Can anyone explain to me how this works in RepliByte? The design document only talked about Postgres.For example let's say I have a MySQL database, how does RepliByte copy that database into S3?Does it use mysqldump or are they coping the database index files? We have a script that automatically backs up our production database at intervals to S3 and then a program to download the latest backup and scrub the data.It takes a heck of a long time to download and impacts the server when it happens... it's been on my todo list to replace with Percona's Xtrabackup [1] but doesn't look that's what these guys are doing?<pre><code> - Database Subsetting: Scale down a production database to a more reasonable size </code></pre> What about this? Does the database need foreign keys to prevent related rows in tables being lost and are they just randomly deleting rows as the config seems to indicate [2][1] <a href="https://www.percona.com/software/mysql-database/percona-xtrabackup" rel="nofollow">https://www.percona.com/software/mysql-database/percona-xtra...</a>[2] <a href="https://github.com/qovery/replibyte#configuration" rel="nofollow">https://github.com/qovery/replibyte#configuration</a>

bprasannaabout 3 years ago

One of my colleagues has developed a sophisticated Data generator addressing the needs of workloads/algorithms which work based on the characteristics of the data.<a href="https://github.com/jssprasanna/redgene" rel="nofollow">https://github.com/jssprasanna/redgene</a>ReDGene - Relational Data Generator is a tool aimed at taking control over the data generation with being able to generate column vectors in a table with required type, interval, length, cardinality, skewness and constraints like Primary Key, Foreign Key, Foreign Key(unique 1:1 mapping), Composite Primary Key and Composite Foreign Key.And this is DB agnostic, it generates data in the form of flat files which can be imported into any database which supports importing data from flat files.

exdsqabout 3 years ago

You’d imagine Postgres or whatever would have a built in function to populate a DB based on types as a sort of fuzzing tool tbhI worked on a gov app years ago that required anonymized databases and I remember thinking that then - why isn’t it available out the box? Everyone must need this from time to time

评论 #31188369 未加载

onion2kabout 3 years ago

This project needs a giant heading box in the README stating 3 things;- staging databases that hold data generated from production databases should be considered production data, with the same level of consideration for security and access as production.- staging databases that hold production data are a GDPR violation waiting to happen. Make sure your data controller / lawyers knows exactly what you're doing with production data.- ask yourself why you need production data in staging in the first place. What are you gaining over a script that generates data? If you want data at scale you can generate it randomly. If you want data that covers all edge cases you can generate it non-randomly. If you want "real-looking" data then maybe this tool is useful.People copying data from production to staging and then failing to look after it properly is a nightmare. It shouldn't be encouraged except in very unusual circumstances. In my experience of dev, your development and staging data should be covering the weird edge cases that you need to handle far more than the nice "happy path" data you get in production.

评论 #31189133 未加载

xwowsersxabout 3 years ago

This looks very useful and I have an immediate need for this. I'm still not sure how to use replibyte to do the following: I want to take a snapshot of the DB from one of my environments and then seed a local DB with it. I see this is a basic use case of replibyte, but not sure exactly how to accomplish this.I have a docker container running postgres and I just want to take the snapshot and seed it into that. How exactly do I do this?

评论 #31195035 未加载

a_cabout 3 years ago

Am I missing the obvious, why would one seed a dev database from production? If anything, data on dev should exist before production?

评论 #31190245 未加载

fedeb95about 3 years ago

interesting, however couldn't it detect tables and columns automatically instead of having to specify them in the configuration file? If I understand correctly each table is to be specified by hand. Say I have nearly a hundred tables...

评论 #31189139 未加载

评论 #31187337 未加载