TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Launch HN: Datafold (YC S20) – Diff Tool for SQL Databases

189 pointsby hichkakeralmost 5 years ago
Hi HN! My name is Gleb. I&#x27;m here with my co-founder Alex to tell you about our company Datafold (<a href="https:&#x2F;&#x2F;datafold.com" rel="nofollow">https:&#x2F;&#x2F;datafold.com</a>). Datafold lets you diff large datasets for fast and powerful regression testing. We support databases such as PostgreSQL, Snowflake, BigQuery, and Redshift.<p>One of the biggest pain points in developing ETL pipelines – chains of jobs that move, clean, merge and aggregate analytical data – has been regression testing: verifying how a change in source code (mostly, SQL) affects the produced data.<p>Early in my career, as an on-call data engineer at Lyft, I accidentally introduced a breaking code change while attempting to ship a hotfix at 4AM to a SQL job that computed tables for core business analytics. A seemingly small change in filtering logic ended up corrupting data for all downstream pipelines and breaking dashboards for the entire company. Apart from being a silly mistake, this highlighted the lack of proper tooling for testing changes. If there had been a way to quickly compare the data computed by production code vs. the hotfix branch, I would have immediately spotted the alarming divergence and avoided merging the breaking change.<p>Without a diffing tool, the typical options for regression testing are: (1) Data “unit tests” (e.g. check primary key uniqueness, ensure values are within interval, etc.) – these are helpful, but costly investment. Frameworks such as dbt make it easier, but it’s often still prohibitively hard to verify all assumptions in a large table. (2) Write custom SQL queries to compare data produced by the prod and dev versions of the source code (e.g. compare counts, match primary keys). This can easily take up 100+ lines of SQL and hours of unsatisfying work, which no one really wants to do. (3) &quot;Fuck It, Ship It&quot; is always an option but too risky nowadays as analytical data not only powers dashboards but also production ML models.<p>As this problem is common in data engineering, some large organizations have built and open-sourced their solutions – for example, BigDiffy by Spotify. However, most of these tools are CLI-based and produce results in a plain-text format which is hard to comprehend when you are dealing with complex data.<p>To fit existing workflows of our users, we’ve built a web interface with interactive charts showing both diff summary statistics (e.g. % of different values by column) and value-level side-by-side comparison (git diff style). But since the mission of the tool is to save engineers as much time as possible, we also opened an API for automation through Airflow or other orchestrators, and built a Github workflow that runs diff on every pull request with changes to ETL code. Since billion-row-scale datasets are not uncommon nowadays, there is an optional sampling feature that helps keep compute costs low and get results within a few minutes no matter how large the dataset is.<p>We&#x27;ve found Datafold to be a good fit for the following workflows: (1) Developing data transformations – before an ETL job is shipped to production, it undergoes multiple iterations. Often it’s important to see how data changes between every iteration, and particularly useful if you have 1M+ rows and 100+ columns where “SELECT *” becomes useless. (2) Code review &amp; testing: large organizations have hundreds of people committing to ETL codebases. Understanding the impact of even a modest SQL diff is daunting. Datafold can produce a data diff for every commit in minutes so changes are well understood. (3) Data transfer validation: moving large volumes of data between databases is error-prone, especially if done via change data capture (CDC): a single lost event can affect the resulting dataset in a way that is tricky to debug. We allow comparing datasets across different databases, e.g. PostgreSQL &amp; Snowflake.<p>We&#x27;ve set up a sandbox at <a href="https:&#x2F;&#x2F;app.datafold.com&#x2F;hackernews" rel="nofollow">https:&#x2F;&#x2F;app.datafold.com&#x2F;hackernews</a> so you can see how diffing works. Shoot us an email (hn@datafold.com) to set up a trial and use it with your own data.<p>We are passionate about improving tooling for data engineers and would love to hear about your experience with developing data pipelines and ensuring data quality. Also, if you think that dataset diffing can be helpful in other domains, we are very curious to learn from you!

23 comments

GordonSalmost 5 years ago
I was curious about pricing, but I see it&#x27;s &quot;call me pricing&quot; with buttons to schedule a demo, so at least I can see this is squarely aimed at the enterprise. If I&#x27;m being honest, I don&#x27;t like seeing &quot;call me pricing&quot; on HN; there are no rules against it, but it just doesn&#x27;t feel right on HN.<p>Are you able to say anything about pricing here?
评论 #24072754 未加载
评论 #24076855 未加载
throwaway_pdp09almost 5 years ago
I see a lot of these things and I don&#x27;t understand them. I&#x27;ve done too much ETL so I&#x27;m not naive. Now either 1) people are making a mountain out of a molehill (not saying that&#x27;s happening here, but in other cases I think so) 2) there&#x27;s something my experience of ETL hasn&#x27;t taught me or 3) these tools are specialised for niches. This one talks about &#x27;large datasets&#x27; but I don&#x27;t know how large that is.<p>Some questions then<p>&gt; Often it’s important to see how data changes between every iteration, and particularly useful if you have 1M+ rows and 100+ columns where “SELECT *” becomes useless.<p>select is fine for diffing. You just do an either-way except , something like<p><pre><code> ( select f1, f2, f3 ... f100 from t1 except select f1, f2, f3 ... f100 from t2 ) union ( select f1, f2, f3 ... f100 from t2 except select f1, f2, f3 ... f100 from t1 ) </code></pre> used this and it&#x27;s fine on many rows (millions is fine but I do recommend and index and a DB with a halfway decent optimiser).<p>&gt; (2)<p>Interesting. OK.<p>&gt; (3) Data transfer validation: moving large volumes of data between databases is error-prone<p>Really? I never had a problem. What is &#x27;large&#x27;? what problems have you seen? There are easy solutions with checksums, error correction (comes free with networks) or round-tripping, is that a problem?<p>Edit, just done that with mssql tables, 8 cols, 38 bytes per row, ~776,000 rows (identical but for one row), diff as above takes 2 seconds without an index (with PK it takes 5 seconds. Sigh. Well done MS). The single row discrepancy shows up fine. Totally trivial to extend it to 100 columns (did that too in previous job).
评论 #24073489 未加载
评论 #24074822 未加载
ishchekleinalmost 5 years ago
Hey! Looks great! Is there an example of the Github integration - how does it looks like?<p>I&#x27;m one of the developers and maintainer of the DVC project and we recently released CML.dev- which integrates with Github and can be used to run some checks on data as well. But in our case it&#x27;s about analyzing files more or less. I&#x27;m curious how does that integration look like in your case.
评论 #24074003 未加载
hodgesrmalmost 5 years ago
It is always good to see new approaches to testing but I don&#x27;t see how this one is going to work. I&#x27;ve worked at multiple database companies. Diff&#x27;ing data is one of the weakest and most cumbersome ways to verify correctness.<p>Diffs are relatively slow, when they fail you get a blizzard of errors, and the oracles (i.e. the &quot;good&quot; output) have to be updated constantly as the product changes. Plus I don&#x27;t see how this helps with schema migration or performance issues, which are major problems in data management. And don&#x27;t get me started on handling things like dates, which change constantly, hence break diffs.<p>If you really care about correctness it&#x27;s better to use approaches like having focused test cases that check specific predicates on data. They can run blindingly fast and give you actionable data about regressions. They&#x27;re also a pain to code but are most productive in the long run.
评论 #24088378 未加载
stephane-kleinalmost 5 years ago
At the moment I use <a href="https:&#x2F;&#x2F;github.com&#x2F;djrobstep&#x2F;migra" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;djrobstep&#x2F;migra</a> to make PostgreSQL diff.
评论 #24072608 未加载
itpragmatikalmost 5 years ago
1) Will it allow me to diff AWS RDS Aurora&#x2F;MySQL serverA database schemaA against AWS RDS Aurora&#x2F;MySQL serverB database schemaB ? 2) Are there APIs to initiate and view&#x2F;parse these diffs that you generate or is it all through UI?
评论 #24074185 未加载
igorludialmost 5 years ago
To whome it may concern, we have written a paper on the subject (I&#x27;m not affiliated with datafold): <a href="https:&#x2F;&#x2F;www.researchgate.net&#x2F;publication&#x2F;323563341_A_Time-constrained_Algorithm_for_Integration_Testing_in_a_Data_Warehouse_Environment" rel="nofollow">https:&#x2F;&#x2F;www.researchgate.net&#x2F;publication&#x2F;323563341_A_Time-co...</a><p>The paper describes the original algorithm with examples.
MaxwellMalmost 5 years ago
Much needed! Analysts struggle with this all the time - trying to explain why an analysis is off and having to manually debug every column in a new database
kveykvaalmost 5 years ago
In case you&#x27;re unaware, your logo looks unfortunately a bit uncannily similar to that of <a href="https:&#x2F;&#x2F;www.sigmacomputing.com&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.sigmacomputing.com&#x2F;</a> given you are both in a similar broader product category! I actually thought your logo throughout the site was actually a reference to integration with sigma at first.
评论 #24088282 未加载
gregalmost 5 years ago
I got the chance to play with Datafold and I would have <i></i>loved<i></i> to have had it when I was working for Facebook on data pipelines.
nicoburnsalmost 5 years ago
I&#x27;ve recently discovered, and highly recommend Daff [0]. It&#x27;s an open source tool that can diff CSVs and SQLlite database tables. The tabular diff format is fantastic.<p>[0]: <a href="https:&#x2F;&#x2F;github.com&#x2F;paulfitz&#x2F;daff" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;paulfitz&#x2F;daff</a>
brainlessalmost 5 years ago
Hey Gleb, congrats on the launch. This is an interesting tool.<p>Myself being a founder of a product in the space of tools for data-driven companies, I wanted to ask -<p>Is you product aimed entirely at data engineers? The description seemed very technical and a problem that mostly very large companies would have. Did I understand correctly?
评论 #24072541 未加载
chapsalmost 5 years ago
Can this be used locally for datasets with non-transferable PII? Thinking about this for non-profit work.
评论 #24073763 未加载
gregwebsalmost 5 years ago
Great tool! I am only interested in running such a tool locally (on-prem). This avoids security&#x2F;privacy issues and data transfer time&#x2F;cost issues.<p>A good model for me would be 30 day free license to get it integrated into our worklows.
评论 #24088226 未加载
jjirsaalmost 5 years ago
For Cassandra: <a href="https:&#x2F;&#x2F;github.com&#x2F;apache&#x2F;cassandra-diff" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;apache&#x2F;cassandra-diff</a><p>(Designed for correctness testing at petabyte scale)
sbr464almost 5 years ago
Can you import&#x2F;work with .bcp files? (Microsoft SQL server bulk export files). For example, diffing 2 bcp files, even if you need to import and set them up as databases again.
评论 #24075819 未加载
lihanalmost 5 years ago
How does it work behind the scene? Is it simply sample a portion of the data then do the diff? What if I need 100% accuracy?
评论 #24088249 未加载
samblralmost 5 years ago
Congrats on the launch.<p>How does it compare to enterprise ETL tools like Informatica, Talend - is it not possible to do these within them ?
评论 #24073314 未加载
blunt69almost 5 years ago
Will there be integration with ORACLE Dbs?
评论 #24073516 未加载
chwolfealmost 5 years ago
Very cool, nice work! SQL Server &#x2F; Azure SQL support available or on the roadmap?
评论 #24072866 未加载
评论 #24075025 未加载
FlashBlazealmost 5 years ago
It is interesting to see Ant Design used other than my personal projects.
husseinyalmost 5 years ago
Very cool! Does it work with MS SQL in our own DC?
评论 #24088252 未加载
greenie_beansalmost 5 years ago
love it, fucked up and made a startup with a solution so you&#x27;ll never make that mistake again.