This is an article from Jan 2022 when we were a company of 10, and now are a company of ~80.<p>Worth some observations that:<p>- We're still using Fivetran for the EL stages. Costs are much more significant than they were before and we're looking (for the high volume sources) into options like DataStream as cost savers, but it's not unmanageable.<p>- dbt is still working great, even if we've done a lot of investment having now built a 5 person data team (BI, DA, DE) around it.<p>- Still use Metabase but have some frustrations and are considering other options.<p>- We no longer use Stitch :tada:<p>There's a post that followed this on improvements we made to our setup that may be interesting: <a href="https://incident.io/blog/updated-data-stack" rel="nofollow">https://incident.io/blog/updated-data-stack</a><p>The OP is still full of relevant, useful information, though (imo, of course).
What's the business justification for spending this much effort (money) on data warehousing as a startup?<p>I've not worked at any startups that did data warehousing, the one place I did work at where we were /starting/ to get it setup was like 300+ employees and $100M+/year revenue.
Meta does it another way. Instead of one giant data warehouse or various DW silos, build a data platform API stack supporting heterogeneous storage adapters, privacy policies, regional locality policies, and retention policies underneath supporting heterogeneous D*L operations. This sidesteps duplication of and denormalizing data and allows for maximum data discovery, reporting, and reuse. And while GraphQL can't be all things to all people, it's pretty damn good. If needing {MySQL,PostgreSQL,{{other_thing}}}-compatible or REST APIs, then build them similarly.<p>ETL should be minimized (except for external data, which is a bad sign of data owned or managed by a third-party) and replaced with the equivalent of dynamic or materialized "views". Prefer to create hygienic "views" of data against original data rather than mutating and destroying such original data with destructive transformations.<p>Finally, have a deeply-integrated, robust, enterprise-wide, fine-grained ACL system and privacy policy to keep everyone (and system users) from accessing anything without a specific business purpose need and an approval audit record stored via some sort of blockchain-like tech.
I’d be curious to know if you considered using something like Dagster for orchestrating these runs? Seems like a more natural choice over CircleCI for running what resembles a DAG. (And either way, thanks for sharing this.)
This is likely here now due to <a href="https://news.ycombinator.com/item?id=38797640">https://news.ycombinator.com/item?id=38797640</a> being on top of the fp and referencing it.