科技回声

7 条评论

Ive been in the end stage of this (worked on data validation for a good chunk of my career) and these are my thoughts on the article:Determining blocking vs non blocking is a big issue - deciding which checks should be stoppers and which shouldn’t is often a matter of extensive debate. In my experience, only a few data checks are absolute show stoppers under any circumstance and a lot of things need to spawn tickets that should be routed to the correct team and followed up on. Some type of tracking system is necessary for this.Defining the logic of checks themselves in YAML is a trap. We went down this DSL route first and it basically just completely falls apart once you want to add moderately complex logic to your check. AirBnB will almost certainly discover this eventually. YAML does work well for the specification of how the check should behave though (eg metadata of the data check). The solution we were eventually able to scale up with was coupling specifications in a human readable but parseable file with code in a single unit known as the check. These could then be grouped according to various pipeline use cases.A model that plugs into an Airflow DAG as AirBnB has designed seems like a good approach. Often when it was time to incorporate checks into the pipeline we had heterogenous strategies to invoke our checks engines. Having a standardized approach helps drive adoption across the organization- oftentimes I’ve found that people are reluctant to run non critical checks if it’s a significant time and effort cost and will only run critical ones to try and push data quality accountability either upstream or downstream. If it’s really easy to turn on and incorporate that’s one less excuse that can be used to not run the checks.

评论 #31454197 未加载

评论 #31454380 未加载

评论 #31475411 未加载

评论 #31488877 未加载

quadrifoliate大约 3 年前

I'm a little bit annoyed at reading about details that seem closely connected to internal code (e.g. CheckConfigModel classes) without being able to see the source.I am not sure what others find so compelling about this blog post. Granted it's from Airbnb which probably has one of the more interesting data sets, but honestly it looks to me like an internal blog post that's been reposted to Medium without considering the viewpoint of an external user. I understand if they don't want to open source the framework; but then most of the blog post should be about design principles, maybe a bit about the process itself — not implementation details that seem directed towards an internal audience.

jm1271大约 3 年前

Thanks for this post! Naive question: why not "just use Great Expectations"? At first blush GE seems like it has a lot of what you need out of the box: checks definable in YAML, extensibility, and connectors to many major data sources.Was there something you all found lacking there which made "roll your own" the right approach here?

评论 #31464255 未加载

charlysl大约 3 年前

From related <a href="https://medium.com/airbnb-engineering/data-quality-at-airbnb-e582465f3ef7" rel="nofollow">https://medium.com/airbnb-engineering/data-quality-at-airbnb...</a>> The new role requires Data Engineers to be strong across several domains, including data modeling, pipeline development, and software engineering.> comprehensive guidelines for data modeling, operations, and technical standards for pipeline implementation> Tables must be normalized (within reason) and rely on as few dependencies as possible. Minerva does the heavy lifting to join across data models.> When we began the Data Quality initiative, most critical data at Airbnb was composed via SQL and executed via Hive. This approach was unpopular among engineers, as SQL lacked the benefits of functional programming languages (e.g. code reuse, modularity, type safety, etc)> made the shift to Spark, and aligned on the Scala API as our primary interface. Meanwhile, we ramped investment into a common Spark wrapper to simplify reads/write patterns and integration testing.> needed to improve was our data pipeline testing. This slowed iteration speed and made it difficult for outsiders to safely modify code. We required that pipelines be built with thorough integration tests> tooling for executing data quality checks and anomaly detection, and required their use in new pipelines. Anomaly detection in particular has been highly successful in preventing quality issues in our new pipelines.> important datasets are required to have an SLA for landing times, and pipelines are required to be configured with Pager Duty> a Spec document that provides layman’s descriptions for metrics and dimensions, table schemas, pipeline diagrams, and describes non-obvious business logic and other assumptions> a data engineer then builds the datasets and pipelines based on the agreed upon specification

geoffjentry大约 3 年前

Is this available for others to use or internal only? I think the answer is the latter as a google search didn't turn anything up and I didn't see anything in the article. But if I'm wrong I'd love to kick the tires a bit.

testbjjl大约 3 年前

Maybe Jim Buckmaster and Craig Neumark are taking notes.

评论 #31453876 未加载

d_burfoot大约 3 年前

> Hive SQL, Spark SQL, Scala Spark, PySpark and Presto are widely used as different execution enginesThis makes me think they're doing something very very wrong. AirBNB does not have data on the scale that would require these tools. They have 5.6 million listings, 150 million users, and 1 billion total person-stays. These numbers can easily be processed with Postgres or SQLite on single machines. Spark and Hive are for companies like Google and Facebook.<a href="https://www.thezebra.com/resources/home/airbnb-statistics/#infographic" rel="nofollow">https://www.thezebra.com/resources/home/airbnb-statistics/#i...</a>

How Airbnb Built “Wall” to prevent data bugs

7 条评论

How Airbnb Built “Wall” to prevent data bugs

7 条评论