From related <a href="https://medium.com/airbnb-engineering/data-quality-at-airbnb-e582465f3ef7" rel="nofollow">https://medium.com/airbnb-engineering/data-quality-at-airbnb...</a><p>> The new role requires Data Engineers to be strong across several domains, including data modeling, pipeline development, and software engineering.<p>> comprehensive guidelines for data modeling, operations, and technical standards for pipeline implementation<p>> Tables must be normalized (within reason) and rely on as few dependencies as possible. Minerva does the heavy lifting to join across data models.<p>> When we began the Data Quality initiative, most critical data at Airbnb was composed via SQL and executed via Hive. This approach was unpopular among engineers, as SQL lacked the benefits of functional programming languages (e.g. code reuse, modularity, type safety, etc)<p>> made the shift to Spark, and aligned on the Scala API as our primary interface. Meanwhile, we ramped investment into a common Spark wrapper to simplify reads/write patterns and integration testing.<p>> needed to improve was our data pipeline testing. This slowed iteration speed and made it difficult for outsiders to safely modify code. We required that pipelines be built with thorough integration tests<p>> tooling for executing data quality checks and anomaly detection, and required their use in new pipelines. Anomaly detection in particular has been highly successful in preventing quality issues in our new pipelines.<p>> important datasets are required to have an SLA for landing times, and pipelines are required to be configured with Pager Duty<p>> a Spec document that provides layman’s descriptions for metrics and dimensions, table schemas, pipeline diagrams, and describes non-obvious business logic and other assumptions<p>> a data engineer then builds the datasets and pipelines based on the agreed upon specification