Launch HN: Datafold (YC S20) – Diff Tool for SQL Databases

189 pointsby hichkakeralmost 5 years ago

Hi HN! My name is Gleb. I'm here with my co-founder Alex to tell you about our company Datafold (<a href="https://datafold.com" rel="nofollow">https://datafold.com</a>). Datafold lets you diff large datasets for fast and powerful regression testing. We support databases such as PostgreSQL, Snowflake, BigQuery, and Redshift.One of the biggest pain points in developing ETL pipelines – chains of jobs that move, clean, merge and aggregate analytical data – has been regression testing: verifying how a change in source code (mostly, SQL) affects the produced data.Early in my career, as an on-call data engineer at Lyft, I accidentally introduced a breaking code change while attempting to ship a hotfix at 4AM to a SQL job that computed tables for core business analytics. A seemingly small change in filtering logic ended up corrupting data for all downstream pipelines and breaking dashboards for the entire company. Apart from being a silly mistake, this highlighted the lack of proper tooling for testing changes. If there had been a way to quickly compare the data computed by production code vs. the hotfix branch, I would have immediately spotted the alarming divergence and avoided merging the breaking change.Without a diffing tool, the typical options for regression testing are: (1) Data “unit tests” (e.g. check primary key uniqueness, ensure values are within interval, etc.) – these are helpful, but costly investment. Frameworks such as dbt make it easier, but it’s often still prohibitively hard to verify all assumptions in a large table. (2) Write custom SQL queries to compare data produced by the prod and dev versions of the source code (e.g. compare counts, match primary keys). This can easily take up 100+ lines of SQL and hours of unsatisfying work, which no one really wants to do. (3) "Fuck It, Ship It" is always an option but too risky nowadays as analytical data not only powers dashboards but also production ML models.As this problem is common in data engineering, some large organizations have built and open-sourced their solutions – for example, BigDiffy by Spotify. However, most of these tools are CLI-based and produce results in a plain-text format which is hard to comprehend when you are dealing with complex data.To fit existing workflows of our users, we’ve built a web interface with interactive charts showing both diff summary statistics (e.g. % of different values by column) and value-level side-by-side comparison (git diff style). But since the mission of the tool is to save engineers as much time as possible, we also opened an API for automation through Airflow or other orchestrators, and built a Github workflow that runs diff on every pull request with changes to ETL code. Since billion-row-scale datasets are not uncommon nowadays, there is an optional sampling feature that helps keep compute costs low and get results within a few minutes no matter how large the dataset is.We've found Datafold to be a good fit for the following workflows: (1) Developing data transformations – before an ETL job is shipped to production, it undergoes multiple iterations. Often it’s important to see how data changes between every iteration, and particularly useful if you have 1M+ rows and 100+ columns where “SELECT *” becomes useless. (2) Code review & testing: large organizations have hundreds of people committing to ETL codebases. Understanding the impact of even a modest SQL diff is daunting. Datafold can produce a data diff for every commit in minutes so changes are well understood. (3) Data transfer validation: moving large volumes of data between databases is error-prone, especially if done via change data capture (CDC): a single lost event can affect the resulting dataset in a way that is tricky to debug. We allow comparing datasets across different databases, e.g. PostgreSQL & Snowflake.We've set up a sandbox at <a href="https://app.datafold.com/hackernews" rel="nofollow">https://app.datafold.com/hackernews</a> so you can see how diffing works. Shoot us an email (hn@datafold.com) to set up a trial and use it with your own data.We are passionate about improving tooling for data engineers and would love to hear about your experience with developing data pipelines and ensuring data quality. Also, if you think that dataset diffing can be helpful in other domains, we are very curious to learn from you!

23 comments

GordonSalmost 5 years ago

I was curious about pricing, but I see it's "call me pricing" with buttons to schedule a demo, so at least I can see this is squarely aimed at the enterprise. If I'm being honest, I don't like seeing "call me pricing" on HN; there are no rules against it, but it just doesn't feel right on HN.Are you able to say anything about pricing here?

评论 #24072754 未加载

评论 #24076855 未加载

throwaway_pdp09almost 5 years ago

I see a lot of these things and I don't understand them. I've done too much ETL so I'm not naive. Now either 1) people are making a mountain out of a molehill (not saying that's happening here, but in other cases I think so) 2) there's something my experience of ETL hasn't taught me or 3) these tools are specialised for niches. This one talks about 'large datasets' but I don't know how large that is.Some questions then> Often it’s important to see how data changes between every iteration, and particularly useful if you have 1M+ rows and 100+ columns where “SELECT *” becomes useless.select is fine for diffing. You just do an either-way except , something like<pre><code> ( select f1, f2, f3 ... f100 from t1 except select f1, f2, f3 ... f100 from t2 ) union ( select f1, f2, f3 ... f100 from t2 except select f1, f2, f3 ... f100 from t1 ) </code></pre> used this and it's fine on many rows (millions is fine but I do recommend and index and a DB with a halfway decent optimiser).> (2)Interesting. OK.> (3) Data transfer validation: moving large volumes of data between databases is error-proneReally? I never had a problem. What is 'large'? what problems have you seen? There are easy solutions with checksums, error correction (comes free with networks) or round-tripping, is that a problem?Edit, just done that with mssql tables, 8 cols, 38 bytes per row, ~776,000 rows (identical but for one row), diff as above takes 2 seconds without an index (with PK it takes 5 seconds. Sigh. Well done MS). The single row discrepancy shows up fine. Totally trivial to extend it to 100 columns (did that too in previous job).

评论 #24073489 未加载

评论 #24074822 未加载

ishchekleinalmost 5 years ago

Hey! Looks great! Is there an example of the Github integration - how does it looks like?I'm one of the developers and maintainer of the DVC project and we recently released CML.dev- which integrates with Github and can be used to run some checks on data as well. But in our case it's about analyzing files more or less. I'm curious how does that integration look like in your case.

评论 #24074003 未加载

hodgesrmalmost 5 years ago

It is always good to see new approaches to testing but I don't see how this one is going to work. I've worked at multiple database companies. Diff'ing data is one of the weakest and most cumbersome ways to verify correctness.Diffs are relatively slow, when they fail you get a blizzard of errors, and the oracles (i.e. the "good" output) have to be updated constantly as the product changes. Plus I don't see how this helps with schema migration or performance issues, which are major problems in data management. And don't get me started on handling things like dates, which change constantly, hence break diffs.If you really care about correctness it's better to use approaches like having focused test cases that check specific predicates on data. They can run blindingly fast and give you actionable data about regressions. They're also a pain to code but are most productive in the long run.

评论 #24088378 未加载

stephane-kleinalmost 5 years ago

At the moment I use <a href="https://github.com/djrobstep/migra" rel="nofollow">https://github.com/djrobstep/migra</a> to make PostgreSQL diff.

评论 #24072608 未加载

itpragmatikalmost 5 years ago

1) Will it allow me to diff AWS RDS Aurora/MySQL serverA database schemaA against AWS RDS Aurora/MySQL serverB database schemaB ? 2) Are there APIs to initiate and view/parse these diffs that you generate or is it all through UI?

评论 #24074185 未加载

igorludialmost 5 years ago

To whome it may concern, we have written a paper on the subject (I'm not affiliated with datafold): <a href="https://www.researchgate.net/publication/323563341_A_Time-constrained_Algorithm_for_Integration_Testing_in_a_Data_Warehouse_Environment" rel="nofollow">https://www.researchgate.net/publication/323563341_A_Time-co...</a>The paper describes the original algorithm with examples.

MaxwellMalmost 5 years ago

Much needed! Analysts struggle with this all the time - trying to explain why an analysis is off and having to manually debug every column in a new database

kveykvaalmost 5 years ago

In case you're unaware, your logo looks unfortunately a bit uncannily similar to that of <a href="https://www.sigmacomputing.com/" rel="nofollow">https://www.sigmacomputing.com/</a> given you are both in a similar broader product category! I actually thought your logo throughout the site was actually a reference to integration with sigma at first.

评论 #24088282 未加载

gregalmost 5 years ago

I got the chance to play with Datafold and I would have loved to have had it when I was working for Facebook on data pipelines.

nicoburnsalmost 5 years ago

I've recently discovered, and highly recommend Daff [0]. It's an open source tool that can diff CSVs and SQLlite database tables. The tabular diff format is fantastic.[0]: <a href="https://github.com/paulfitz/daff" rel="nofollow">https://github.com/paulfitz/daff</a>

brainlessalmost 5 years ago

Hey Gleb, congrats on the launch. This is an interesting tool.Myself being a founder of a product in the space of tools for data-driven companies, I wanted to ask -Is you product aimed entirely at data engineers? The description seemed very technical and a problem that mostly very large companies would have. Did I understand correctly?

评论 #24072541 未加载

chapsalmost 5 years ago

Can this be used locally for datasets with non-transferable PII? Thinking about this for non-profit work.

评论 #24073763 未加载

gregwebsalmost 5 years ago

Great tool! I am only interested in running such a tool locally (on-prem). This avoids security/privacy issues and data transfer time/cost issues.A good model for me would be 30 day free license to get it integrated into our worklows.

评论 #24088226 未加载

jjirsaalmost 5 years ago

For Cassandra: <a href="https://github.com/apache/cassandra-diff" rel="nofollow">https://github.com/apache/cassandra-diff</a>(Designed for correctness testing at petabyte scale)

sbr464almost 5 years ago

Can you import/work with .bcp files? (Microsoft SQL server bulk export files). For example, diffing 2 bcp files, even if you need to import and set them up as databases again.

评论 #24075819 未加载

lihanalmost 5 years ago

How does it work behind the scene? Is it simply sample a portion of the data then do the diff? What if I need 100% accuracy?

评论 #24088249 未加载

samblralmost 5 years ago

Congrats on the launch.How does it compare to enterprise ETL tools like Informatica, Talend - is it not possible to do these within them ?

评论 #24073314 未加载

blunt69almost 5 years ago

Will there be integration with ORACLE Dbs?

评论 #24073516 未加载

chwolfealmost 5 years ago

Very cool, nice work! SQL Server / Azure SQL support available or on the roadmap?

评论 #24072866 未加载

评论 #24075025 未加载

FlashBlazealmost 5 years ago

It is interesting to see Ant Design used other than my personal projects.

husseinyalmost 5 years ago

Very cool! Does it work with MS SQL in our own DC?

评论 #24088252 未加载

greenie_beansalmost 5 years ago

love it, fucked up and made a startup with a solution so you'll never make that mistake again.