We built a modern data stack from scratch and reduced our bill by 70%

83 点作者 jchandra3 个月前

17 条评论

These just seems like over engineered solutions trying to guarantee their job security. When the dataflows are so straight forward, just replicate into pick your OLAP, and transform there.

评论 #43314420 未加载

评论 #43317158 未加载

评论 #43314505 未加载

1a527dd53 个月前

There is something here that doesn't sit right.We use BQ and Metabase heavily at work. Our BQ analytics pipeline is several hundred TBs. In the beginning we had data (engineer|analyst|person) run amock and run up a BQ bill around 4,000 per month.By far the biggest things was:-- partition key was optional -> fix: required- bypass the BQ caching layer -> fix: make queries use deterministic inputs [2]It took a few weeks to go through each query using the metadata tables [1] but it worth it. In the end our BQ analysis pricing was down to something like 10 per day.[1] <a href="https://cloud.google.com/bigquery/docs/information-schema-jobs-timeline#jobs_timeline-view" rel="nofollow">https://cloud.google.com/bigquery/docs/information-schema-jo...</a>[2] <a href="https://cloud.google.com/bigquery/docs/cached-results#cache-exceptions" rel="nofollow">https://cloud.google.com/bigquery/docs/cached-results#cache-...</a>

SkyPuncher3 个月前

I know it's easy to be critical, but I'm having trouble seeing the ROI on this.This is a $20k/year savings. Perhaps, I'm not aware of the pricing in the Indian market (where this startup is), but that simply doesn't seem like a good use of time. There's an actual cost of doing these implementations. Both in hard financial dollars (salaries of the people doing the work) and the trade-offs of de prioritizing other other.

评论 #43314248 未加载

评论 #43314517 未加载

评论 #43314325 未加载

评论 #43340672 未加载

评论 #43316394 未加载

bob10293 个月前

When working with ETL, it really helps to not conflate the letters or worry about them in the wrong order. A lot of the most insane complexity comes out of moving too quickly with data.If you don't have good staging data after running extraction (i.e., a 1:1 view of the source system data available in your database), there is nothing you can do to help with this downstream. You should stop right there and keep digging.Extracting the data should be the most challenging aspect of an ETL pipeline. It can make a lot of sense to write custom software to handle this part. It is worth the investment because if you do the extraction really well, the transform & load stages can happen as a combined afterthought [0,1,2,3] in many situations.This also tends to be one of the fastest ways to deal with gigantic amounts of data. If you are doing things like pulling 2 different tables and joining them in code as part of your T/L stages, you are really missing out on the power of views, CTEs, TVFs, merge statements, etc.[0] <a href="https://learn.microsoft.com/en-us/sql/t-sql/statements/merge-transact-sql" rel="nofollow">https://learn.microsoft.com/en-us/sql/t-sql/statements/merge...</a>[1] <a href="https://www.postgresql.org/docs/current/sql-merge.html" rel="nofollow">https://www.postgresql.org/docs/current/sql-merge.html</a>[2] <a href="https://docs.oracle.com/database/121/SQLRF/statements_9017.htm" rel="nofollow">https://docs.oracle.com/database/121/SQLRF/statements_9017.h...</a>[3] <a href="https://www.ibm.com/docs/en/db2/12.1?topic=statements-merge" rel="nofollow">https://www.ibm.com/docs/en/db2/12.1?topic=statements-merge</a>

评论 #43315041 未加载

cratermoon3 个月前

AKA The Monty Hall Rewrite <a href="https://alexsexton.com/blog/2014/11/the-monty-hall-rewrite" rel="nofollow">https://alexsexton.com/blog/2014/11/the-monty-hall-rewrite</a>

评论 #43315553 未加载

ripped_britches3 个月前

So you saved just $20k per year? Not sure the context of your company but I’m not sure if this turns out to be a net win given the cost of engineering resources to produce this infra gain

评论 #43317375 未加载

评论 #43314333 未加载

653 个月前

Why is data engineering so complicated?I'm not a data engineer but was tasked with building an ETL pipeline for a large company. It's all just Step Functions, looping through file streams in a Lambda, transforming, then putting the data into Snowflake for the analytics team to view. My pipeline processes billions of rows from many different sources. Each pipeline runs daily on a cron job (maybe that's the key differentiator, we don't need live streaming data, it's a lot of point of sale data).Whenever I hear actual data engineers talk about pipelines there are always a million different tools and complicated sounding processes. What's am I missing?

评论 #43316126 未加载

jchandra3 个月前

We did have a discussion on Self vs Managed and TCOs associated with it. 1> We have multi regional setup so it came up with Data Sovereignty requirements. 2> Vendor Lock ins - Few of the services were not available in that geographic region 3> With managed services, you often pay for capacity you might not always use. our workloads were often consistent and predictable, so self managed solutions helped in fine tuning our resources. 4> One og the goal was to keep our storage and compute loosely coupled while staying Iceberg-compatible for flexibility. Whether it’s Trino today or Snowflake/Databricks tomorrow, we aren’t locked in.

thecleaner3 个月前

This is a little bit of a word soup. Its hard to see why the various redesigns were done without a set of requirements. I dont get why you'd trigger Airflow workflows for doing CDC. These things were designed for large scale batch jobs rather than doing CDC on a some Google sheets. Either way without scale numbers its hard to why PG was used or why the shift to BigQuery. Anyways the site uses Hugo, which actually sticks out for me.

reillyse3 个月前

How much did this cost in engineering time and how much will it cost to maintain? How about when you need to add a new feature? Seems like you saved roughly 1.5k per month which pays for a couple days of engineering time (ignoring product,mgmt and costs related to maintaining the software)

评论 #43314466 未加载

评论 #43315175 未加载

vivahir2153 个月前

Good read.I do have a question on the BigQuery. i f you were experiencing unpredictable query costs or customization issues, that sounds like user error. There are ways to optimize or commit slots for reducing the cost. Did you try that ?

评论 #43312290 未加载

tacker20003 个月前

Is Debezium the only good CDC tool out there? I have a fairly simple data stack and am looking at integrating a CDC solution but I really dont want to touch Kafka just for this. Are there any easier alternatives?

评论 #43318500 未加载

评论 #43315064 未加载

评论 #43314566 未加载

评论 #43314397 未加载

rockwotj3 个月前

Why confluent instead of something like MSK, Redpanda or one of the new leaderless, direct to S3 Kafka implementations?

评论 #43314639 未加载

slake3 个月前

What's the database that dbt is connected to in this scenario?

mosselman3 个月前

You’d think that pushing all of the data into any ldap database, but especially some of the newer postgres based ones would give you all the performance you need at 10% of the costs? Let alone all the maintenance of the mind boggling architecture drawing.

throwaway77833 个月前

.. how many engineers?

评论 #43317171 未加载

moandcompany3 个月前

> "We are a fintech startup helping SMEs raise capital from our platform where we provide diverse financial products ranging from Term Loan, Revenue Based Financing to Syndication, we face one unique data challenge: Our data comes from everywhere."Reading carefully: the result of this work yields an expected $21,000 USD in annual operating cost savings for infrastructure services.Is this resume driven development?What was the opportunity cost of this work? Is the resulting system more or less maintainable by future employees/teammates?