Looking to get feedback for a code-first platform for data: instead of custom frameworks, GUIs, notebooks on a chron, bauplan runs SQL / Python functions from your IDE, in the cloud, backed by your object storage. Everything is versioned and composable: time-travel, git-like branches, scriptable meta-logic.<p>Perhaps surprisingly, we decided to co-design the abstractions and the runtime, which allowed novel optimizations at the intersection of FaaS and data - e.g. rebuilding functions can be 15x faster than the corresponding AWS stack (<a href="https://arxiv.org/pdf/2410.17465" rel="nofollow">https://arxiv.org/pdf/2410.17465</a>). All capabilities are available to humans (CLI) and machines (SDK) through simple APIs.<p>Would love to hear the community’s thoughts on moving data engineering workflows closer to software abstractions: tables, functions, branches, CI/CD etc.
Looks interesting! Bauplan seems like a mix of an orchestration engine and a data warehouse. It's similar to Motherduck as it runs DuckDB on managed EC2, with more data engineer-focused branching and Python support similar to SQLMesh.<p>It's interesting that most vendors compute in their own managed account instead of BYOC though. I understand it's hard to manage compute on the customer cloud for vendors, but I was under the impression that it's a no-go for most enterprise companies. Maybe I'm wrong?
The Git-like approach to data versioning seems <i>really</i> promising to me, but I'm wondering what those merge operations are expected to look like in practice. In a coding environment, I'd review the PR basically line-by-line to check for code quality, engineering soundness, etc. But in the data case it's not clear to me that a line-by-line review would be possible, or even useful; and I'm also curious about what (if any) tooling is provided to support it?<p>For example: I saw the YouTube video demo someone linked here where they had an example of a quarterly report pipeline. Say that I'm one of two analysts tasked with producing that report, and my coworker would like to land a bunch of changes. Say in their data branch, the topline report numbers are different from `main` by X%. Clearly it's due to <i>some</i> change in the pipeline, but it seems like I will still have to fire up a notebook and copy+paste chunks of the pipeline to see step-by-step where things are different. Is there another recommended workflow (or even better: provided tooling) for determining which deltas in the pipeline contributed to the X% difference?
Congrats on the more official launch! Super promising, first product that shares dbt-type data organization/orchestration capabilities with a compute layer worthy of replacing existing data warehouses/python environments.
For someone like me (who is not an ML expert, but can write Python fluently) Bauplan looks like an ideal fit. Looking forward to taking a deeper look and building something in production.