Show HN: Neosync – Open-Source Data Anonymization for Postgres and MySQL

246 点作者 edrenova12 个月前

Hey HN, we're Evis and Nick and we're excited to be launching Neosync (<a href="https://www.github.com/nucleuscloud/neosync">https://www.github.com/nucleuscloud/neosync</a>). Neosync is an open source platform that helps developers anonymize production data, generate synthetic data and sync it across their environments for better testing, debugging and developer experience.Most developers and teams have some version of a database seed script that creates some mock data for their local and stage databases. The problem is that production data is messy and it’s very difficult to replicate that with mock data. This causes two big problems for developers.The first problem is that features seem to work locally/stage but have bugs and edge cases in production because the seed data you used to develop against was not representative of production data.The second problem we saw was that debugging production errors would take a long time and would often resurface. When we see a bug in production, the first thing we want to do is reproduce it locally, but if we can’t reproduce the state of the data locally, then we’re kind of flying blind.Working directly with production data would solve both of these problems but most teams can’t because of: (1) privacy/security issues and (2) scale. So we set out to solve these two problems with Neosync.We solve the privacy and security problem using anonymization and synthetic data. We have 40+ pre-built transformers (or you can write your own in code) that can anonymize PII or sensitive data so that it’s safe to use locally. Additionally, you can generate synthetic data from scratch that fits your existing schema across your database.The second problem is scale. Some production databases are too big to fit locally or just have more data than you need. Also, in some cases, you may want to debug a certain customer’s data and you only want their data. We solve this with subsetting. You can pass in a SQL query to filter your table(s) and Neosync will handle all of the heavy lifting including referential integrity.At the core of Neosync does three things: (1) It streams data from a source to one or multiple destination databases. We never store your sensitive data. (2) While that data is being streamed, we transform it. You define which schemas and tables you want to sync and at the column level, select a transformer that defines how you want to anonymize the data or generate synthetic data. (3) We subset your data based on your filters.We do all of this while handling referential integrity. Whether you have primary keys, foreign keys, unique constraints, circular dependencies (within a table and across tables), sequences and more, Neosync preserves those references.We also ship with APIs, a Terraform provider, a CLI and Github action that you can use to hydrate a CI database.Neosync is an open source project written in Go and Typescript and can be run on Docker Compose, Bare Metal, or Kubernetes via Helm. You can also use our hosted platform or managed platform that you can deploy in your VPC. We also have a hosted platform with a generous free tier - <a href="https://neosync.dev">https://neosync.dev</a>Here's a brief loom demo: <a href="https://www.loom.com/share/ac21378d01cd4d848cf723e4960e8338?sid=2faf613c-92be-44fa-9278-c8087e777356" rel="nofollow">https://www.loom.com/share/ac21378d01cd4d848cf723e4960e8338?...</a>We'd love any feedback you have!

10 条评论

gregwebs12 个月前

Great to see such a project. We are using datanymizer [1] right now but it has gone unmaintained and we are using my patched version [2] and it is working pretty well for us. I saw a new project that is getting close in terms of having the feature set I need and has them on their roadmap [3].To ensure that we are marking columns as PII, we run a job that compares the anonymization configuration to a comment on the column- we have a comment on every column to mark it as PII (or not).[1] <a href="https://github.com/datanymizer/datanymizer">https://github.com/datanymizer/datanymizer</a>[2] <a href="https://github.com/digitalmint/datanymizer/tree/digitalmint">https://github.com/digitalmint/datanymizer/tree/digitalmint</a>[3] <a href="https://github.com/GreenmaskIO/greenmask">https://github.com/GreenmaskIO/greenmask</a>Other tools I found that do some anonymization but didn't meet my needs:<pre><code> * https://github.com/DivanteLtd/anonymizer * https://postgresql-anonymizer.readthedocs.io/en/stable * https://nitzano.github.io/dbzar/ * https://github.com/Qovery/Replibyte</code></pre>

评论 #40451860 未加载

评论 #40451903 未加载

blopker12 个月前

I don't know exactly how this works, but I wanted to share my experience trying to anonymize data. Don't.While you may be able to change or delete obvious PII, like names, every bit of real data in aggregate leads to revealing someone's identity. They are male? That's half the population. They also live in Seattle, are Hispanic, age 18-25? Down to a few hundred thousand. They use Firefox? That might be like 10 people.This is why browser fingerprinting is so effective. It's how Ad targeting works.Just stick with fuzzing random data during development. Many web frameworks already have libraries for doing this. Django for example has factory_boy[0]. You just tell it what model to use, and the factory class will generate data based on your schema. You'll catch more issues this way anyway because computers are better at making nonsensical data.Keep production data in production.[0]: <a href="https://factoryboy.readthedocs.io/en/stable/orms.html" rel="nofollow">https://factoryboy.readthedocs.io/en/stable/orms.html</a>

评论 #40445266 未加载

评论 #40446466 未加载

评论 #40454664 未加载

评论 #40452056 未加载

评论 #40526241 未加载

评论 #40445144 未加载

评论 #40448119 未加载

imiric12 个月前

Congrats on the launch!This topic is relevant to what I'm currently working on, and I'm finding it exhausting to be honest. After considering several options for anonymizing both Postgres and ClickHouse data, I've been evaluating clickhouse-obfuscator[1] for a few weeks now. The idea in principle is great, since ClickHouse allows you to export both its data and Postgres data (via their named collections feature) into Parquet format (or a bunch of others, but we settled on Parquet), and then using clickhouse-obfuscator to anonymize the data and store it in Parquet as well, which can then be imported where needed.The problem I'm running into is referential integrity, as importing the anonymized data is raising unique and foreign key violations. The obfuscator tool is pretty minimal and has few knobs to tweak its output, so it's difficult to work around this, and I'm considering other options at this point.Your tool looks interesting, and it seems that you directly address the referential integrity issue, which is great.I have a couple of questions:1. Does Neosync ensure that anonymized data has the same statistical significance (distribution, cardinality, etc.) as source data? This is something that clickhouse-obfuscator put quite a lot of effort in addressing, as you can see from their README. Generating synthetic data doesn't solve this, and some anonymization tools aren't this sophisticated either.2. How does it differ from existing PG anonymization solutions, such as PostgreSQL Anonymizer[2]? Obviously you also handle MySQL, but I'm interested in PG specifically.As a side note, I'm not sure I understand what the value proposition of your Cloud service would be. If the original data needs to be exported and sent to your Cloud for anonymization, it defeats the entire purpose of this process, and only adds more risk. I don't think that most companies looking for a solution like this would choose to rely on an external service. Thanks for releasing it as open source, but I can't say that I trust your business model to sustain a company around this product.[1]: <a href="https://github.com/ClickHouse/ClickHouse/blob/master/programs/obfuscator/README.md">https://github.com/ClickHouse/ClickHouse/blob/master/program...</a>[2]: <a href="https://postgresql-anonymizer.readthedocs.io/en/stable/" rel="nofollow">https://postgresql-anonymizer.readthedocs.io/en/stable/</a>

评论 #40447065 未加载

mathisd12 个月前

During an internship, I was part of a team that developed a collection of tools [0] intended to provide pseudonymization of production database for testing and development purposes. These tools were developed while used in parallel with clients that had a large number of database.Referential constraint refer to ensuring some coherence / basic logic in the output data (ie. the anonymized street name must exist in the anonymized city). This was the most time consuming phase of the pseudonymization process. They were working on introducing pseudonymization with cross-referential constraint which is a mess as constraint were often strongly intertwined. Also, a lot of the time client had no proper idea of what the field were and what they were truly containing (what format of phone number, we did find a lot of unusual things).[0] (LINO, PIMO, SIGO, etc.) <a href="https://github.com/CGI-FR/PIMO">https://github.com/CGI-FR/PIMO</a>

评论 #40446813 未加载

enahs-sf12 个月前

I love that it's open-source. Great project and very applicable across a lot of industries, especially those deeply affected by compliance.

pitah112 个月前

Thanks for sharing. Happy to see another solution that doesn't just slap on AI/ML to try to solve it.I am also among the many people who have created a solution similar[0] to this :). The approach I took though is being metadata-driven (given most anonymisation solutions cannot guarantee sensitive data not leaking and also open up network access from prod to test envs, security teams did not accept it whilst I was working at a bank), offering the option to validate based on the generated data (i.e. check if your service or job has consumed the data correctly) and ability to clean up the generated or consumed data.Being metadata-driven opened up the possibility of linking to existing metadata services like data catalogs (OpenMetadata, Amundsen), data quality (Great Expectations, Soda), specification files (OpenAPI/Swagger), etc., which are often underutilized.The other part that I found whilst building and getting feedback from customers, was having referential integrity across data sources. For example, account create events coming through Kafka, consumed and stored in Postgres whilst, at the end of the day, a CSV file of the same accounts would also be consumed by a job.I'm wondering if you have come across similar thoughts or feedback from your users?[0]: <a href="https://github.com/data-catering/data-caterer">https://github.com/data-catering/data-caterer</a>

kjuulh12 个月前

I just published our approach to pseudo anonymization and sort of anonymization.We'd built a tool which can traverse data extract the pii data and put back a token into the data. Before one of our allowed systems would read the data we'd swap in the actual data or an anonymized version if we didn't permission to use it anymore. So we sort of get the best of both worlds we can use the actual data of our customers because we require it, but can safely use data for analytics and retain a lot of the statistical variance of our data.Crazy complex project to work on given our limited resources but very fulfilling in the end.It should be mentioned that I don't mention the difference between anonymization and pseudo anonymization in the article mostly because I didn't know it was really a thing. I just implemented a solution given or requirements<a href="https://tech.lunar.app/blog/data-anonymization-at-scale" rel="nofollow">https://tech.lunar.app/blog/data-anonymization-at-scale</a>

评论 #40448138 未加载

评论 #40447738 未加载

aj__chan12 个月前

Amazing open source project! I can see pretty broad application to basically every application developers stack as they're building out their tools, but also working with real world production data in developer environments that don't break compliance. Great work, Evis & Nick!

chairmanwow112 个月前

Interesting, but why does it matter that I actually keep the same the same statistical distributions of data in development as in production? What are the use cases for that kind of feature?

评论 #40446120 未加载

ngcazz12 个月前

Hey, this looks quite cool! Just spotted this link on your site's frontpage is 404ing <a href="https://www.neosync.dev/solutions/keep-environments-in-sync">https://www.neosync.dev/solutions/keep-environments-in-sync</a> (was quite keen to read this one specifically)

评论 #40457881 未加载

评论 #40457039 未加载