Hi everyone
I would like to show the project we have been working on for a couple of years: Arc, an opinionated framework for defining predictable, repeatable and manageable data transformation pipelines;<p>- predictable in that data is used to define transformations - not code<p>- repeatable in that if a job is executed multiple times it will produce the same result<p>- manageable in that execution considerations and logging have been baked in from the start<p>- MIT licensed open-source and cloud agnostic<p>We have seen that it is hard to scale data engineering teams in a code-first environment. Arc solves a lot of the problems we have seen data engineering/science teams struggle with. It:<p>- makes data engineering accessible to audiences outside of data engineers - you don't need to be proficient at Scala/Spark to introduce data engineering into your team<p>- has a Jupyter Notebook based development environment to quickly build logic<p>- provides a clear path to production for machine learning (via MLTransform, TensorflowServingTransform or HTTPTransform for models as a service)<p>- has a plugin system allowing federated development for any features not in the base framework<p>Currently it uses the Apache Spark execution engine but due to its declarative nature can be executed against future engines.<p>Please let us know if you have any feedback/suggestions.