TechEcho

3 comments

boulosover 8 years ago

Disclosure: I work on Google Cloud, and helped out with this effort.We're hoping to provide some more "How Fermilab did it" later (maybe at NEXT?) but our blog post had a little more information [1]. End-to-end from "Let's do this!" to Supercomputing was about 3 weeks (including some bugfixes to the GCE support in HTCondor).For people asking, Fermilab put up about 500T into a GCS Regional bucket (into us-central1). We have great peering at Google, so I think this was streaming at about 100 Gbps from Fermilab to us.As surmised, the science that we ran was lots of individual tasks, each one needing about 2 GB of RAM per vCPU, so during Supercomputing we right-sized the VMs using Custom Machine Types [2]. IIRC, several tasks needed up to 20 GB per task, so we sized PD-HDD to 20 GB per vCPU. All of this was read from GCS in parallel via gcsfuse [3]; we chose the Regional bucket for GCS to minimize cost per byte, and to maximize throughput efficiency (no reason to replicate elsewhere for this processing).All the data after processing went straight back to Fermilab over that pipe. The output data size though was pretty small IIRC, and I don't think we were ever much over 10Gbps on output.HTCondor was used to submit work from Fermilab onto GCE directly (the submission / schedd boxes were at Fermilab). and spun up Preemptible VMs. We used a mix of custom-32vCPU-64GB in us-central1-b and us-central1-c, as well as custom-16vCPU-32GB in us-central1-a and us-central1-f. You can see a graph of Fermilab's monitoring here [4] when it was all setup. 160k vCPUs for $1400/hr![Edit: Newlines, I always forget two newlines][1] <a href="https://cloudplatform.googleblog.com/2016/11/Google-Cloud-HEPCloud-and-probing-the-nature-of-Nature.html" rel="nofollow">https://cloudplatform.googleblog.com/2016/11/Google-Cloud-HE...</a>[2] <a href="https://cloud.google.com/custom-machine-types/" rel="nofollow">https://cloud.google.com/custom-machine-types/</a>[3] <a href="https://github.com/GoogleCloudPlatform/gcsfuse" rel="nofollow">https://github.com/GoogleCloudPlatform/gcsfuse</a>[4] <a href="https://twitter.com/googlecloud/status/798293201681457154" rel="nofollow">https://twitter.com/googlecloud/status/798293201681457154</a>

dekhnover 8 years ago

DOE and CERN have done a really nice job with HEP, the jobs can basically flow to any compute backend without a lot of effort. Instead of saying "we have a private cloud" or "we use the public cloud", they just say "we use all the clouds".

评论 #13096854 未加载

hcrispover 8 years ago

How does the data get efficiently moved to the spun up cloud nodes? The article mentions parceling out jobs to nodes with independent storage, but how does that help data-intensive work? Are they doing something besides just copying data to the local storage of the cloud nodes? It seems like that would be a bottleneck.

HEPCloud formation: handling troves of high-energy physics data

3 comments

HEPCloud formation: handling troves of high-energy physics data

3 comments