We built our customer data warehouse all on Postgres

234 点作者 samaysharma超过 1 年前

17 条评论

aurareturn超过 1 年前

Postgres is great but using pg_cron as your system to pull and clean data is a recipe for disaster in the long run. I've always regretted using complicated Postgres queries or functions instead of doing it in server. Sometimes I do it because I think "wow, I didn't know pg can do this. So cool, let's just use it!". Regret. Every time. They're too finicky, hard to test, and doesn't have the ecosystem like git and IDEs to make changes easily. And the next employee won't know about some random but crucial logic that lives inside Postgres instead of the code.You're also relying on your Postgres instance's precious CPU/RAM/network to do logic work which can be unpredictable. Let's be honest, most managed cloud Postgres providers charge an arm and a leg for big Postgres instances already. Why use those expensive resources when you can get a cheap EC2 instance or a serverless function to do the same and have the ability to scale efficiently?My general rule is, use Postgres to store as much of my data as possible, but never use Postgres to do anything other than be a data store.This rule of mine is why I never fell in love with Supabase even though many developers have. Use Postgres as my permissions layer? As my API server? You have to be kidding. If you're building a quick throwaway prototype, maybe. (The rest of Supabase is really nice though).PS. To the team at tembo, no one is going to sign up for a free trial without a pricing page.

评论 #39228590 未加载

评论 #39226105 未加载

评论 #39229738 未加载

评论 #39229978 未加载

评论 #39229493 未加载

评论 #39226505 未加载

评论 #39226038 未加载

评论 #39233477 未加载

评论 #39226820 未加载

评论 #39226598 未加载

评论 #39227741 未加载

评论 #39248808 未加载

评论 #39232075 未加载

评论 #39230196 未加载

评论 #39229258 未加载

评论 #39236882 未加载

评论 #39228892 未加载

kdamica超过 1 年前

I would not recommend this approach to anyone, for these reasons:- If your data is small enough that Postgres works as an analytics warehouse, then the cost of running a warehouse like BigQuery are going to be small too, and without the complexity of managing the infra yourself. The company here specializes in Postgres, so it makes sense for them to do this, but for everyone else it makes much more sense to use a warehouse so you can focus on building your product and not worry about scaling your analytics infra.- Because Postgres doesn't scale for analytics, they only keep 30 days of metrics. I don't think this would be acceptable for most businesses. BigQuery costs 2 cents per GiB per month, so storage is generally a minimal expense until you get to pretty large scale.- Tools like Fivetran and Airbyte are much simpler for maintaining these pipelines than setting up FDWs, and each support a huge number of data sources. If you're worried about cost, you can self-host Airbyte or use Stitch (which is kind of a budget player in the space). It's true that Fivetran can get very expensive as you scale, but there is enough competition now that you have options to reduce cost.Disclaimer: I run a data consulting company that specializes in zero to one data projects for startups, including setting up analytics warehouses.

评论 #39229399 未加载

评论 #39229567 未加载

评论 #39229210 未加载

tayloramurphy超过 1 年前

The first data warehouse at Gitlab was postgres! It worked quite well until we started pulling Snowplow event data.I share some of the concerns of other commenters of having postgres do everything. Having a separation of concern around orchestration in particular seems wise. That said, getting something up and running using just pg makes sense and it's always possible to refractor.

draven超过 1 年前

I'm always amazed (and a bit frightened) by the amount of logic that can be implemented directly in the database. How is the code debugged / managed / versioned / deployed ? I would be thankful for any pointer to books / blog posts about that.

评论 #39228223 未加载

评论 #39227725 未加载

评论 #39226369 未加载

评论 #39226731 未加载

评论 #39230103 未加载

评论 #39231777 未加载

评论 #39226554 未加载

评论 #39228000 未加载

评论 #39227702 未加载

osigurdson超过 1 年前

> This becomes a huge cost in the form of cognitive overhead for the team in addition to the time and resources it takes to set up, manage and maintain itThe problem with using one system that you know well for everything is it becomes necessary to build (usually suboptimal) systems that emulate well known ones. In the end you trade one snowflake for another.That being said, there really ought to be standardized approaches to auth, backup and operations as various services today do things very differently and this does indeed increase cognitive load.

评论 #39232728 未加载

simonhaven超过 1 年前

I wish you well with FDW. I thought postgres_fdw seemed great and tried using it to construct cross-database queries. Unfortunately it performed like a complete dog for any kind of batch DML or mildly complex batch select statement, despite me trying every possible trick I could find online. Apparently it's a known drawback with postgres_fdw, supposedly fixed in later versions (which my organisation wasn't ready to upgrade to). Finding much pertinent information about that particular problem either in PG docs or online was a challenge in itself.It turned out to be far quicker, if less than ideal, to have a program with a connection to each database act as middleman because the queries it ran were locally optimised and could be tuned properly.All this may, of course, be a sign of a major gap in my Postgres-fu and I accept all derisory replies with appropriate humility.

fourfun超过 1 年前

I definitely agree that using a tool like Fivetran is generally needless complexity when FDWs would get the job done.

madjam002超过 1 年前

One thing that I think would be awesome is if there was a Postgres extension for streaming queries, where you could join tables with millions of rows and when any of the upstream data is changed you’ll be notified of the delta changes.I tried Materialize but it seemed incredibly wasteful to duplicate terabytes of data and was also extremely slow compared to the same query running in Postgres.For now I’ve rolled my own using Debezium CDC and simply reevaluating entire queries with joins when any data changes, but having something more smart would be ideal.

评论 #39227758 未加载

评论 #39227566 未加载

infecto超过 1 年前

The website does a terrible job of communicating what they are. Had to click to the docs to kind of get a quick summary.In the context of their business I guess this writeup makes sense, it is afterall a marketing piece for their service. I don't agree with it though. I get the simplicity of running everything in a single db stack but I don't think any of the tools they listed were really going to be required to build something like this into another data warehouse.

评论 #39230117 未加载

wokwokwok超过 1 年前

> We were able to build a data warehouse that is easy to maintain, and easy to reason about, and quick to onboard new engineers.A data warehouse?Ingesting raw data != a data warehouse.I presume their cron jobs are significantly more complicated than that, implementing their domain logic in functions, which will not scale well into an easy to maintain system over time as all database function implementations do… or, what they’ve built is something……but that thing is not a data warehouse. It’s just a database.

评论 #39226621 未加载

anymouse123456超过 1 年前

I was looking for a way to run migrations across SQLite and Postgres without requiring an unrelated language ecosystem (e.g., Node, Python, Ruby, Java, etc), and discovered Shmig.It's pretty great.<a href="https://github.com/mbucc/shmig">https://github.com/mbucc/shmig</a>Would love to see DB unit test infra with the the same values.

c0balt超过 1 年前

Fyi, The pricing page[0] is unnecessarily hidden (4-5 clicks). Please make it part of the top menu or show it with the free trial screen.[0]: <a href="https://tembo.io/docs/tembo-cloud/billing" rel="nofollow">https://tembo.io/docs/tembo-cloud/billing</a>

hellcow超过 1 年前

Data warehouse use-cases in Postgres for me generally rely on calculating counts/stats over time, and estimates are sufficient. I really like HyperLogLog for this use-case—fast performance, low storage use, and there’s a great Postgres extension that’s widely supported.

debarshri超过 1 年前

Netezza's DWH appliance was built on top of Postgres [1].Building datawarehouse on top of postgres is not something thats new. It is definitely lost art.[1] <a href="https://en.wikipedia.org/wiki/Netezza" rel="nofollow">https://en.wikipedia.org/wiki/Netezza</a>

efxhoy超过 1 年前

I love postgres too. We went from a brittle event sourcing architecture to postgres_fdw, works really well. We do something similar with a few differences:We swap entire schemas on every run (one schema being one upstream database or collection of analysis) atomically instead of deleting data and then inserting data. This means anyone querying the dw never sees an empty table, it's yesterdays data until instantly its todays data, consistently for the entire source. If something breaks the data is just stale, never missing.We don't do long running selects from foreign tables. This is because a long running select of a huge table can take many minutes. If an upstream database deploys a migration which changes the database schema that migration could time-out and stop the deploy while waiting for the select to finish so it can aquire its ACCESS EXCLUSIVE lock. This means we never hold even ACCESS SHARE locks for more than a couple of seconds. Basically the agreement is "the DW should never interrupt the operation of the upstream databases". We do this by a custom batched_create_table_from procedure. For integer id tables it's trivial. For uuid tables we fetch all the uuids in one go into a temp table with an integer id and then get chunks of uuid values at a time. This works well because we're always sending indexed id values to the foreign server so the foreign server can use its primary key indexes to only give us the data we want for that chunk, without scanning the entire table. It's a bit of a hack and I wish postgres_fdw's fetch_size parameter would release locks between fetches. Of course that would break atomicity so I'm doubting it's gonna happen.We run an external cron schedule which launches an ECS task that runs a bash script with our sql files that define the DW's schema. It's a really primitive DAG with bash & and wait. No scheduling happens in the db itself. This means we can schedule, check logs, monitor, etc like all our other ECS tasks. Some bash magic lets us define basically a DAG, though we run blocks of independent tasks (sql files) instead of building a proper dependency tree between individual tasks.I learned about ctid recently, that could have worked for the chunking.We're currently moving to bigquery and DBT for a few reasons. Primarily to be able to handle silly amounts of user interaction data where the experience for analysts is just much better in bigquery. DBT is also really nice, though I do miss the simplicity of just sql files. The development setup for the old DW relied on rebuilding it locally on a developer machine, which eventually ran into storage limitations and people getting lazy and not rebuilding from scratch before submitting their PRs, we just never got around to building proper CI for it. Rebuilding our postgres DW from scratch takes a few hours now, the same logic in bigquery runs in 10 minutes, which is undeniably amazing.

评论 #39231622 未加载

jwmoz超过 1 年前

By "customer data warehouse" I presume they mean "customer database"?

评论 #39228973 未加载

ramesh31超过 1 年前

Sounds expensive.