TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Show HN: A tool to seed your dev database with real data

129 pointsby ev0xmusicabout 3 years ago
A bunch of developers and myself have created RepliByte - an open-source tool to seed a development database from a production database.<p>Features:<p><pre><code> - Support data backup and restore for PostgreSQL, MySQL and MongoDB - Replace sensitive data with fake data - Works on large database (&gt; 10GB) (read Design) - Database Subsetting: Scale down a production database to a more reasonable size - Start a local database with the prod data in a single command - On-the-fly data (de)compression (Zlib) - On-the-fly data de&#x2F;encryption (AES-256) - Fully stateless (no server, no daemon) and lightweight binary - Use custom transformers </code></pre> My motivation:<p>As a developer, creating a fake dataset for running tests is tedious. Plus, it does not reflect the real-world data and painful to keep updated. If you prefer to run your app tests with production data. Then RepliByte is for you as well.<p>Available for MacOSX, Linux and Windows.<p>&gt; <a href="https:&#x2F;&#x2F;github.com&#x2F;qovery&#x2F;replibyte" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;qovery&#x2F;replibyte</a>

17 comments

mdanielabout 3 years ago
Please don&#x27;t require static AWS credentials: <a href="https:&#x2F;&#x2F;github.com&#x2F;Qovery&#x2F;replibyte&#x2F;blob&#x2F;v0.4.4&#x2F;replibyte&#x2F;src&#x2F;bridge&#x2F;s3.rs#L36-L37" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;Qovery&#x2F;replibyte&#x2F;blob&#x2F;v0.4.4&#x2F;replibyte&#x2F;sr...</a><p>or at least either include &quot;AWS_SESSION_TOKEN&quot; in that setup (if it is present) in order to allow &quot;aws sts assume-role&quot; to work, or allow `AWS_PROFILE`, or just use the aws-sdk&#x27;s normal credential discovery mechanism which at least on their &quot;main&quot; SDKs is a fallback list of them, but I couldn&#x27;t follow the docs.rs soup in order to know if their rust sdk is up to speed or what
评论 #31186527 未加载
评论 #31189099 未加载
评论 #31187183 未加载
评论 #31186089 未加载
krageonabout 3 years ago
Unless you can exhaustively guarantee your customer-data containing production data will definitely be transformed into something completely unrecognisable and irreversible (and let&#x27;s face it, you can never do so - systems change all the time), using this is irresponsible. The fact that the motivation for it is that it is &quot;tedious&quot; to do the right way doesn&#x27;t exactly inspire confidence, though it is definitely in the spirit of the times.
评论 #31190131 未加载
time4teaabout 3 years ago
From Thoughtworks Tech Radar <a href="https:&#x2F;&#x2F;www.thoughtworks.com&#x2F;radar" rel="nofollow">https:&#x2F;&#x2F;www.thoughtworks.com&#x2F;radar</a><p>21. Production data in test environments Hold We continue to perceive production data in test environments as an area for concern. Firstly, many examples of this have resulted in reputational damage, for example, where an incorrect alert has been sent from a test system to an entire client population. Secondly, the level of security, specifically around protection of private data, tends to be less for test systems. There is little point in having elaborate controls around access to production data if that data is copied to a test database that can be accessed by every developer and QA. Although you can obfuscate the data, this tends to be applied only to specific fields, for example, credit card numbers. Finally, copying production data to test systems can break privacy laws, for example, where test systems are hosted or accessed from a different country or region. This last scenario is especially problematic with complex cloud deployments. Fake data is a safer approach, and tools exist to help in its creation. We do recognize there are reasons for specific elements of production data to be copied, for example, in the reproduction of bugs or for training of specific ML models. Here our advice is to proceed with caution.
评论 #31189685 未加载
gregwebsabout 3 years ago
Using a customer&#x27;s production data outside of production probably violates their expectations of your data security practices. I couldn&#x27;t see myself using this unless there was a mode where only allowed fields are copied and non-id fields are first transformed in a lossy way.
评论 #31189627 未加载
评论 #31189118 未加载
评论 #31188856 未加载
jlgaddisabout 3 years ago
I&#x27;m (not) looking forward to the future data breach notifications &#x2F; post-mortems that include something like &quot;... our developers used a tool to copy the production database to a dev database on their laptop ...&quot;<p>Honestly, I&#x27;m kinda surprised by the lack of comments advocating against doing this.
claytongulickabout 3 years ago
The comments I&#x27;ve read on here seem strangely negative, I don&#x27;t understand why.<p>I think this tool looks great!<p>I appreciate the time and effort you put on to releasing a free and open source tool to help solve a real problem.<p>Keep up the great work!
评论 #31189200 未加载
tjpnzabout 3 years ago
I like this but after a cursory glance at the source I have a few concerns:<p>- There&#x27;s a transformer which appears to retain the first char on string fields. That&#x27;s not safe if you&#x27;re dealing with customer data.<p>- Remove telemetry. That it&#x27;s claimed to be anonymized and togglable is meaningless where sensitive data is concerned.
评论 #31196974 未加载
nicoburnsabout 3 years ago
This does sounds kind of useful. On the other hand I performed a similar task just yesterday using the native pg_dump and pg_restore commands, and it only took a couple of hours to setup (and now I have a repeatable script), so this’ll need to be implemented really well to provide value.
评论 #31189666 未加载
评论 #31186965 未加载
micheljansenabout 3 years ago
This is actually a much harder problem than it seems. GDPR is quite strict about what is considered PII (and rightly so). For example: you may think replacing sensitive data with fake data is enough to anonymise customer data. It&#x27;s not:<p>&gt; &quot;Personal data is any information that relates to an identified or identifiable living individual. Different pieces of information, which collected together can lead to the identification of a particular person, also constitute personal data.&quot;<p>So it&#x27;s not enough to, for example, replace all names, addresses etc. when you can still see which products someone has interacted with, when their account was created (which in the production DB would relate back to their actual account!) or any other unexpected pieces of information that links back to their identity.<p>In practice, this means that any realistic production-derived data is either very likely to be still considered PII (and therefore much more demanding to handle safely and securely) or has to be mangled so much that it is no longer representative of production data.
dvasdekisabout 3 years ago
I was thinking &quot;oh! this is awesome!&quot;, and then noticed it didn&#x27;t support MSSQL. Not to worry, I&#x27;ll just contribute a connector. Let&#x27;s take a look at their existing connector code...<p><a href="https:&#x2F;&#x2F;github.com&#x2F;Qovery&#x2F;replibyte&#x2F;blob&#x2F;main&#x2F;replibyte&#x2F;src&#x2F;source&#x2F;postgres.rs" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;Qovery&#x2F;replibyte&#x2F;blob&#x2F;main&#x2F;replibyte&#x2F;src&#x2F;...</a><p>Not a single comment to say what anything does. Sigh. It&#x27;s the same for the other drivers too.
评论 #31189089 未加载
husainfazelabout 3 years ago
<p><pre><code> - Works on large database (&gt; 10GB) (read Design) </code></pre> Can anyone explain to me how this works in RepliByte? The design document only talked about Postgres.<p>For example let&#x27;s say I have a MySQL database, how does RepliByte copy that database into S3?<p>Does it use mysqldump or are they coping the database index files? We have a script that automatically backs up our production database at intervals to S3 and then a program to download the latest backup and scrub the data.<p>It takes a heck of a long time to download and impacts the server when it happens... it&#x27;s been on my todo list to replace with Percona&#x27;s Xtrabackup [1] but doesn&#x27;t look that&#x27;s what these guys are doing?<p><pre><code> - Database Subsetting: Scale down a production database to a more reasonable size </code></pre> What about this? Does the database need foreign keys to prevent related rows in tables being lost and are they just randomly deleting rows as the config seems to indicate [2]<p>[1] <a href="https:&#x2F;&#x2F;www.percona.com&#x2F;software&#x2F;mysql-database&#x2F;percona-xtrabackup" rel="nofollow">https:&#x2F;&#x2F;www.percona.com&#x2F;software&#x2F;mysql-database&#x2F;percona-xtra...</a><p>[2] <a href="https:&#x2F;&#x2F;github.com&#x2F;qovery&#x2F;replibyte#configuration" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;qovery&#x2F;replibyte#configuration</a>
bprasannaabout 3 years ago
One of my colleagues has developed a sophisticated Data generator addressing the needs of workloads&#x2F;algorithms which work based on the characteristics of the data.<p><a href="https:&#x2F;&#x2F;github.com&#x2F;jssprasanna&#x2F;redgene" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;jssprasanna&#x2F;redgene</a><p>ReDGene - Relational Data Generator is a tool aimed at taking control over the data generation with being able to generate column vectors in a table with required type, interval, length, cardinality, skewness and constraints like Primary Key, Foreign Key, Foreign Key(unique 1:1 mapping), Composite Primary Key and Composite Foreign Key.<p>And this is DB agnostic, it generates data in the form of flat files which can be imported into any database which supports importing data from flat files.
exdsqabout 3 years ago
You’d imagine Postgres or whatever would have a built in function to populate a DB based on types as a sort of fuzzing tool tbh<p>I worked on a gov app years ago that required anonymized databases and I remember thinking that then - why isn’t it available out the box? Everyone must need this from time to time
评论 #31188369 未加载
onion2kabout 3 years ago
This project needs a giant heading box in the README stating 3 things;<p>- staging databases that hold data generated from production databases should be considered production data, with the same level of consideration for security and access as production.<p>- staging databases that hold production data are a GDPR violation waiting to happen. Make sure your data controller &#x2F; lawyers knows exactly what you&#x27;re doing with production data.<p>- ask yourself why you need production data in staging in the first place. What are you gaining over a script that generates data? If you want data at scale you can generate it randomly. If you want data that covers all edge cases you can generate it non-randomly. If you want &quot;real-looking&quot; data then maybe this tool is useful.<p>People copying data from production to staging and then failing to look after it properly is a nightmare. It shouldn&#x27;t be encouraged except in very unusual circumstances. In my experience of dev, your development and staging data should be covering the weird edge cases that you need to handle far more than the nice &quot;happy path&quot; data you get in production.
评论 #31189133 未加载
xwowsersxabout 3 years ago
This looks very useful and I have an immediate need for this. I&#x27;m still not sure how to use replibyte to do the following: I want to take a snapshot of the DB from one of my environments and then seed a local DB with it. I see this is a basic use case of replibyte, but not sure exactly how to accomplish this.<p>I have a docker container running postgres and I just want to take the snapshot and seed it into that. How exactly do I do this?
评论 #31195035 未加载
a_cabout 3 years ago
Am I missing the obvious, why would one seed a dev database from production? If anything, data on dev should exist before production?
评论 #31190245 未加载
fedeb95about 3 years ago
interesting, however couldn&#x27;t it detect tables and columns automatically instead of having to specify them in the configuration file? If I understand correctly each table is to be specified by hand. Say I have nearly a hundred tables...
评论 #31189139 未加载
评论 #31187337 未加载