The number of things I've seen explode in cost when using Beam and its managed variants is insane.<p>The general technique of performing ETL by streaming saved data to a compute resource, writing and running a program in the company's lingua franca, and then loading it back is nearly always inefficient. This article underlines just how impactful minor issues with node sizing can be - and its something to stick into your calendar to revisit every six months (and after data and program changes).<p>We don't get the context here, but its generally more cost efficient to keep the data in a real database (e.g. BigQuery), and operate on it using SQL as much as possible. You can perform in-database ETL by loading to different tables, and operating there. For some tasks, you will want to use UDF's, and in rarer instances you will need external ETL - but if those instances aren't first powered by a non-trivial internal query I would be very concerned!<p>One of the main reasons teams don't store data in a database is the structure of it is currently considered incompatible, or they see big challenges with partial and duplicate storage. Another reason is issues around data loading - ingest can be <i>very</i> expensive or <i>very</i> cheap depending on exactly how you do it!<p>A final note; the article was written on the 20th June, and its been a while. It would be great to know the real impact rather than the estimate!<p>> Presented figures are only estimates based on a single run (with only 3% of input data) and extrapolated to the whole year with the assumption that processing the whole dataset will result in the same relative savings as processing 3% of source data.
Would efficiency improve if they were to use larger machines? Same CPU/memory ratio, but more CPUs and memory? Assuming the have more than ~20 VMS for this..<p>it would also be interesting to know if they could get away with a single very large machine instead, like h3-standard-88... From a cost perspective it does not seem too far off from their final solution, that's why the assumption that maybe a single VM could handle the load