TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Ask HN: What is your minimal Data Warehouse stack?

1 点作者 herodoturtle大约 2 年前
Hi all.<p>We&#x27;re wanting to set up a Data Warehouse. We have several external data sources - all relational - most of them MySQL databases, a couple more Postgres.<p>We&#x27;d like to set up and maintain a single MySQL &#x2F; Postgres &quot;Data Warehouse&quot; database that houses all this data, so that our analytics team has a single place to access it from.<p>If you&#x27;ve done something similar please could you share your experience and &#x2F; or advice?<p>Any extra info on how you manage your data pipelines would be appreciated. Currently we&#x27;re just looking at setting up some basic cron jobs that run bash scripts which in turn execute mysqldumps, but we&#x27;ll also set up replication in cases where live data is important.<p>Thanks! :-)

1 comment

gigatexal大约 2 年前
What is the scale of data that you&#x27;re working with? How many analysts will be querying this data?<p>High level I&#x27;d probably do something like this:<p>cdc (debezium) on a read replica of the external sources (or main if a replica doesn&#x27;t exist) -&gt; kafka&#x2F;redpanda (optional since debezium can write directly to a destination table but kafka makes things a bit more flexible though it comes with it&#x27;s own issues) -&gt; destination table (this is the load part of ELT, just load in batches the changes into a staging table) -&gt; great expectations can be useful here to make sure things are in line with what you&#x27;re thinking -&gt; sql+dbt to do transformations + enrichment -&gt; load into your star-schema&#x27;d db from the staging tables, rinse and repeat. Oh and schedule all of this in Airflow or Prefect.<p>I&#x27;d consider something like Clickhouse or Percona or Citus on the postgres side to get columnar semantics.<p>You could forgo the whole DB idea and do the data lakehouse (sic) using s3+parquet+trino and a list of other apache projects to basically reinvent the database wheel but you&#x27;d get a ton of autonomy and the ability to scale up parts as you need just with a ton of additional complexity.
评论 #34870711 未加载