Been working with an org that is really struggling with the reliability and maintainability of their ETLs and data pipelines. Could you share some tools and best practices in 2017?
The backstop "best practice" is to put somebody in charge of the issue and give them both the responsibility and authority to fix it.<p>Next you need to face up to the causes of the problems. There may be five or six root causes, and if you plan to fix just 4 of them, you will pay 80% of the costs, but get 2% or 0% of the benefits in terms of cost savings because the other root causes will still cause chaos, and now people will start to blame the tools and procedures that were tried (those costs will be very visible.)<p>Getting people on board with a realistic plan can be a little bit like getting an alcoholic to recognize the damage that drinking has done to their life, but the alternative is wishful thinking.<p>If you want to get into more specifics, click on my HN id and send me an email.<p>See<p><a href="https://www.amazon.com/Art-Getting-Your-Own-Sweet/dp/0070145156" rel="nofollow">https://www.amazon.com/Art-Getting-Your-Own-Sweet/dp/0070145...</a>
We are using Airflow to manage ETL jobs. Nearly all of these are SQL steps dynamically generated via an Airflow DAG that transform transaction and event data on our SQL warehouse into 'master' tables everyone has access to. All SQL and DAG code is committed into Github and we have a process to update Airflow and merge any changes after its peer reviewed. Every change is done via a PR so we have visibility and accountability.<p>One thing we want to improve is our testing component, curious to hear how people manage test workflows, replicating prod before promoting new pipelines. I.e. I want the branch to run a full test suite against a prod replica before automatically replacing the current prod pipeline.
Can you give more background on what your data pipelines are like? Are they mostly batch processes?<p>If so, I'd strongly recommend using a workflow tool like Luigi[0] or Airflow[1]. In a phrase, I'd say they're like "Make for data".<p>[0]: <a href="https://github.com/spotify/luigi" rel="nofollow">https://github.com/spotify/luigi</a>
[1]: <a href="https://github.com/apache/incubator-airflow" rel="nofollow">https://github.com/apache/incubator-airflow</a>