Hi all.<p>We're wanting to set up a Data Warehouse. We have several external data sources - all relational - most of them MySQL databases, a couple more Postgres.<p>We'd like to set up and maintain a single MySQL / Postgres "Data Warehouse" database that houses all this data, so that our analytics team has a single place to access it from.<p>If you've done something similar please could you share your experience and / or advice?<p>Any extra info on how you manage your data pipelines would be appreciated. Currently we're just looking at setting up some basic cron jobs that run bash scripts which in turn execute mysqldumps, but we'll also set up replication in cases where live data is important.<p>Thanks! :-)
What is the scale of data that you're working with? How many analysts will be querying this data?<p>High level I'd probably do something like this:<p>cdc (debezium) on a read replica of the external sources (or main if a replica doesn't exist) -> kafka/redpanda (optional since debezium can write directly to a destination table but kafka makes things a bit more flexible though it comes with it's own issues) -> destination table (this is the load part of ELT, just load in batches the changes into a staging table) -> great expectations can be useful here to make sure things are in line with what you're thinking -> sql+dbt to do transformations + enrichment -> load into your star-schema'd db from the staging tables, rinse and repeat. Oh and schedule all of this in Airflow or Prefect.<p>I'd consider something like Clickhouse or Percona or Citus on the postgres side to get columnar semantics.<p>You could forgo the whole DB idea and do the data lakehouse (sic) using s3+parquet+trino and a list of other apache projects to basically reinvent the database wheel but you'd get a ton of autonomy and the ability to scale up parts as you need just with a ton of additional complexity.