Hi HN! My name is Gleb. I'm here with my co-founder Alex to tell you
about our company Datafold (<a href="https://datafold.com" rel="nofollow">https://datafold.com</a>). Datafold lets you diff
large datasets for fast and powerful regression testing. We support
databases such as PostgreSQL, Snowflake, BigQuery, and Redshift.<p>One of the biggest pain points in developing ETL pipelines –
chains of jobs that move, clean, merge and aggregate analytical data –
has been regression testing: verifying how a change in source code (mostly,
SQL) affects the produced data.<p>Early in my career, as an on-call data engineer at Lyft, I
accidentally introduced a breaking code change while attempting to
ship a hotfix at 4AM to a SQL job that computed tables for core
business analytics. A seemingly small change in filtering logic ended
up corrupting data for all downstream pipelines and breaking
dashboards for the entire company. Apart from being a silly mistake,
this highlighted the lack of proper tooling for testing changes. If
there had been a way to quickly compare the data computed by
production code vs. the hotfix branch, I would have immediately
spotted the alarming divergence and avoided merging the breaking
change.<p>Without a diffing tool, the typical options for regression testing
are: (1) Data “unit tests” (e.g. check primary key uniqueness, ensure
values are within interval, etc.) – these are helpful, but costly
investment. Frameworks such as dbt make it easier, but it’s
often still prohibitively hard to verify all assumptions in a large
table. (2) Write custom SQL queries to compare data produced by the
prod and dev versions of the source code (e.g. compare counts, match
primary keys). This can easily take up 100+ lines of SQL and hours of
unsatisfying work, which no one really wants to do. (3) "Fuck It, Ship
It" is always an option but too risky nowadays as analytical data not
only powers dashboards but also production ML models.<p>As this problem is common in data engineering, some large
organizations have built and open-sourced their solutions – for
example, BigDiffy by Spotify. However, most of these tools are
CLI-based and produce results in a plain-text format which is hard to
comprehend when you are dealing with complex data.<p>To fit existing workflows of our users, we’ve built a web interface
with interactive charts showing both diff summary statistics (e.g. %
of different values by column) and value-level side-by-side comparison
(git diff style). But since the mission of the tool is to save
engineers as much time as possible, we also opened an API for
automation through Airflow or other orchestrators, and built a Github
workflow that runs diff on every pull request with changes to ETL
code. Since billion-row-scale datasets are not uncommon nowadays,
there is an optional sampling feature that helps keep compute costs
low and get results within a few minutes no matter how large the
dataset is.<p>We've found Datafold to be a good fit for the following workflows: (1)
Developing data transformations – before an ETL job is shipped to
production, it undergoes multiple iterations. Often it’s important to
see how data changes between every iteration, and particularly useful
if you have 1M+ rows and 100+ columns where “SELECT *” becomes
useless. (2) Code review & testing: large organizations have hundreds
of people committing to ETL codebases. Understanding the impact of
even a modest SQL diff is daunting. Datafold can produce a data diff
for every commit in minutes so changes are well understood. (3) Data
transfer validation: moving large volumes of data between databases is
error-prone, especially if done via change data capture (CDC): a
single lost event can affect the resulting dataset in a way that is
tricky to debug. We allow comparing datasets across different
databases, e.g. PostgreSQL & Snowflake.<p>We've set up a sandbox at <a href="https://app.datafold.com/hackernews" rel="nofollow">https://app.datafold.com/hackernews</a> so you can see
how diffing works. Shoot us an email (hn@datafold.com) to set up a trial and use it with
your own data.<p>We are passionate about improving tooling for data engineers and would
love to hear about your experience with developing data pipelines and
ensuring data quality. Also, if you think that dataset diffing can be
helpful in other domains, we are very curious to learn from you!