Ray: A Distributed Framework for Emerging AI Applications

134 pointsby mlerneralmost 4 years ago

12 comments

sillysaurusxalmost 4 years ago

I used Ray to train a massive GPT model by putting each layer on a separate TPU. Ray was able to send all the gradients back and forth as needed.It scaled fine up to 33 TPUs (i.e. 33 layers).Ray is impressive as hell.By the way, I didn't write the code to do any of that. kindiana, aka "the guy that wrote GPT-J", also happened to write this:<a href="https://github.com/kingoflolz/swarm-jax" rel="nofollow">https://github.com/kingoflolz/swarm-jax</a>I just ran it and it worked. Which is extraordinarily unusual for TPUs, historically speaking.I'm pushing my luck at this point, since it's crossing the line from enthusiasm to spam. But if you want to try out Ray on TPUs like I did here, I posted a (massive) amount of detail on how to get started, and why: <a href="https://news.ycombinator.com/item?id=27728225" rel="nofollow">https://news.ycombinator.com/item?id=27728225</a>That's the last I'll be mentioning it for some time, though.Ray + JAX is such a killer combo.

评论 #27733575 未加载

评论 #27734979 未加载

Rich_Morinalmost 4 years ago

Although it's early days, José Valim and some other folks are working on adding AI-related capabilities to Elixir and the (Erlang) BEAM. See "Introducing Nx" (<a href="https://www.youtube.com/watch?v=fPKMmJpAGWc" rel="nofollow">https://www.youtube.com/watch?v=fPKMmJpAGWc</a>) for an intro.Given that they already have a robust Actor model as a base to work from, it occurs to me that they may be able to use some of Ray's ideas as they go along...

Tenokealmost 4 years ago

I've used Ray for about a year (typically for thousands of ML tasks, spread across ~48-120 cores simultaneously) and it's a pleasure to use at least using the basic API. Admittedly, I had problems when trying to use some of the more advanced approaches but I didn't really need them and I can definitely recommend it since the performance is great.

评论 #27731855 未加载

reubenbondalmost 4 years ago

Given how Ray "provides [...] exactly-once semantics" for its actors, you could draw similarities between it and workflow-as-code frameworks such as <a href="https://temporal.io" rel="nofollow">https://temporal.io</a>. The way that Ray splits up actors and tasks looks similar to Temporal's Workflows + Activities split: Workflows (Ray actors) contain orchestration logic and have their method calls/results durably logged. Activities (Ray tasks) perform the expensive computations and any interaction with external systems and are not durably logged.If you're in the .NET ecosystem or interested in distributed systems in general, you may like Orleans (<a href="https://github.com/dotnet/orleans" rel="nofollow">https://github.com/dotnet/orleans</a>), which I work on at Microsoft. Orleans contributes the Virtual Actor model which other modern actor frameworks are starting to adopt since it is well suited for the hectic, failure-prone environment of distributed systems (which those so-called Cloud Native Apps live in). The Ray paper linked from the article (<a href="https://www.usenix.org/system/files/osdi18-moritz.pdf" rel="nofollow">https://www.usenix.org/system/files/osdi18-moritz.pdf</a>) discusses some similarities. Slight correction on the paper: it states that "For message delivery, Orleans provides at-least-once [...] semantics". It's at-most-once. At-least-once messaging semantics (usually implemented via automatic retries) aren't ideal for these kinds of systems, in my opinion.

评论 #27732204 未加载

ramozalmost 4 years ago

I spent the past year and a half deploying a distributed backend for bert-like models & we ultimately chose a K8s architecture & "precise" affinity mapped out, which is still hard due to cpu pinning issues. On the frontend-api, Golang gives us the ability to distribute & split requests coming in (10-20M / day & batch size averaging ~3K which splits into 50 due to model constraints). Embeddings are stored on those nodes, local ssds. Those nodes are only a handful. Models run on 2 pools, 1 dedicated and one preemptible (most nodes here) which gives us cost optimization and scheduling is simplified due to K8s. We have anywhere from 120-300 of these high compute nodes.Wondering if anyone has similar deployments and migrated to Ray. We've evaluated it but can't afford a large migration at this point & would also need to test quite a bit & rebuild our whole automation for infra and apps.Really interested though as the infrastructure isn't cheap and every time the model updates we are basically re-architecting it. Right now we are moving everything away from python (gunicorn/flask, and MKL) to Golang as we can get better efficiencies with data serialization (numpy ops are the biggest time eaters right now ... model input vectors constructed from flatbuffers)

评论 #27732293 未加载

评论 #27731713 未加载

wolfium3almost 4 years ago

There was a recent talk at PyCon US 2021 on this :)TALK / SangBin Cho / Data Processing on Ray [<a href="https://www.youtube.com/watch?v=DNLqvdov_J4" rel="nofollow">https://www.youtube.com/watch?v=DNLqvdov_J4</a>]

flakinessalmost 4 years ago

At a glance, Ray is a re-invention (or rebranding) of distributed object system plus agent, which was popular around '90-'00. Things like Java RMI and CORBA (remember?) was part of the trend, until REST killed them all.On top of the distributed object foundation, Ray added a ML-oriented twists like efficient numerical data transfer with Apache Arrow, and shifted focus from (classic) agent system to RL, or general distributed ML training in general, accompanied by the Python-first approach - which simplifies a lot of things compared to traditional, often language-agnostic distributed objects.I'm not claiming Ray is not novel. Rather, my point is that what a dated idea needs to come back may be just some relevant-today twists like these. I think Ray is good demonstration of possibility of such old new things.

blueyesalmost 4 years ago

Our team has used Ray for more than two years across several versions. It makes a lot of things easy that were not, and is especially adapted for our purposes, which include training and deploying a lot of reinforcement learning policies. The AnyScale team is very responsive on the support Slack, fwiw.

phissenschaftalmost 4 years ago

Great work and kudos to the Ray team! It's definitely a fresh look with a lot of lessons learned from previous generations (e.g. spark).There are a few nice features I wish Ray would eventually get to.On the user experience side, it would be nice to have task level logs: often time it's easier for users to reason at task level, especially the task is a facade that triggers other complicated library/subprocess calls.For the scheduler, if there's more native support for sharded/bundled/partitioned tasks and <a href="https://cloud.google.com/blog/products/gcp/no-shard-left-behind-dynamic-work-rebalancing-in-google-cloud-dataflow" rel="nofollow">https://cloud.google.com/blog/products/gcp/no-shard-left-beh...</a>

robertnishiharaalmost 4 years ago

Hi all, I'm one of the authors of Ray, thanks for all the comments and discussion! To add to the discussion, I'll mention a few conceptual things that have changed since we wrote the paper.*Emphasis on the library ecosystem*A lot of our focus is on building an ecosystem of libraries on top of Ray (much, but not all, of the focus is on machine learning libraries). Some of these libraries are built natively on top of Ray such as Ray Tune for scaling hyperparameter search (<a href="http://tune.io" rel="nofollow">http://tune.io</a>), RLlib for scaling reinforcement learning (<a href="http://rllib.io" rel="nofollow">http://rllib.io</a>), Ray Serve for scaling model serving (<a href="http://rayserve.org/" rel="nofollow">http://rayserve.org/</a>), and RaySGD for scaling training (<a href="https://docs.ray.io/en/master/raysgd/raysgd.html" rel="nofollow">https://docs.ray.io/en/master/raysgd/raysgd.html</a>).Some of the libraries are popular libraries on their own, which now integrate with Ray such as Horovod (<a href="https://eng.uber.com/horovod-ray/" rel="nofollow">https://eng.uber.com/horovod-ray/</a>), XGBoost (<a href="https://xgboost.readthedocs.io/en/latest/tutorials/ray.html" rel="nofollow">https://xgboost.readthedocs.io/en/latest/tutorials/ray.html</a>), and Dask for dataframes (<a href="https://docs.ray.io/en/master/dask-on-ray.html" rel="nofollow">https://docs.ray.io/en/master/dask-on-ray.html</a>). While Dask itself has similarities to Ray (especially the task part of the Ray API), Dask also has libraries for scaling dataframes and arrays, which can be used as part of the Ray ecosystem (more details at <a href="https://www.anyscale.com/blog/analyzing-memory-management-and-performance-in-dask-on-ray" rel="nofollow">https://www.anyscale.com/blog/analyzing-memory-management-an...</a>).Many Ray users start using Ray for one of the libraries (e.g., to scale training or hyperparameter search) as opposed to just for the core system.*Emphasis on serverless*Our goal with Ray is to make distributed computing as easy as possible. To do that, we think the serverless direction, which allows people to just focus on their code and not on infrastructure, is very important. Here, I don't mean serverless purely in the sense of functions as a service, but something that would allow people to run a wide variety of applications (training, data processing, inference, etc) elastically in the cloud without configuring or thinking about infrastructure. There's a lot of ongoing work here (e.g., to improve autoscaling up and down with heterogeneous resource types). More details on the topic <a href="https://www.anyscale.com/blog/the-ideal-foundation-for-a-general-purpose-serverless-platform" rel="nofollow">https://www.anyscale.com/blog/the-ideal-foundation-for-a-gen...</a>.If you're interested in this kind of stuff, consider joining us at Anyscale <a href="https://jobs.lever.co/anyscale" rel="nofollow">https://jobs.lever.co/anyscale</a>.