TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Reducing the cost of a single Google Cloud Dataflow Pipeline by Over 60%

9 pointsby jimminyx6 months ago

2 comments

Normal_gaussian6 months ago
The number of things I&#x27;ve seen explode in cost when using Beam and its managed variants is insane.<p>The general technique of performing ETL by streaming saved data to a compute resource, writing and running a program in the company&#x27;s lingua franca, and then loading it back is nearly always inefficient. This article underlines just how impactful minor issues with node sizing can be - and its something to stick into your calendar to revisit every six months (and after data and program changes).<p>We don&#x27;t get the context here, but its generally more cost efficient to keep the data in a real database (e.g. BigQuery), and operate on it using SQL as much as possible. You can perform in-database ETL by loading to different tables, and operating there. For some tasks, you will want to use UDF&#x27;s, and in rarer instances you will need external ETL - but if those instances aren&#x27;t first powered by a non-trivial internal query I would be very concerned!<p>One of the main reasons teams don&#x27;t store data in a database is the structure of it is currently considered incompatible, or they see big challenges with partial and duplicate storage. Another reason is issues around data loading - ingest can be <i>very</i> expensive or <i>very</i> cheap depending on exactly how you do it!<p>A final note; the article was written on the 20th June, and its been a while. It would be great to know the real impact rather than the estimate!<p>&gt; Presented figures are only estimates based on a single run (with only 3% of input data) and extrapolated to the whole year with the assumption that processing the whole dataset will result in the same relative savings as processing 3% of source data.
vander_elst6 months ago
Would efficiency improve if they were to use larger machines? Same CPU&#x2F;memory ratio, but more CPUs and memory? Assuming the have more than ~20 VMS for this..<p>it would also be interesting to know if they could get away with a single very large machine instead, like h3-standard-88... From a cost perspective it does not seem too far off from their final solution, that&#x27;s why the assumption that maybe a single VM could handle the load