TechEcho

Hey HN, we wanted to share our repo where we fine-tuned Llama 3.1 on Google TPUs. We’re building AI infra to fine-tune and serve LLMs on non-NVIDIA GPUs (TPUs, Trainium, AMD GPUs).The problem: Right now, 90% of LLM workloads run on NVIDIA GPUs, but there are equally powerful and more cost-effective alternatives out there. For example, training and serving Llama 3.1 on Google TPUs is about 30% cheaper than NVIDIA GPUs.But developer tooling for non-NVIDIA chipsets is lacking. We felt this pain ourselves. We initially tried using PyTorch XLA to train Llama 3.1 on TPUs, but it was rough: xla integration with pytorch is clunky, missing libraries (bitsandbytes didn't work), and cryptic HuggingFace errors.We then took a different route and translated Llama 3.1 from PyTorch to JAX. Now, it’s running smoothly on TPUs! We still have challenges ahead, there is no good LoRA library in JAX, but this feels like the right path forward.Here's a demo (<a href="https://dub.sh/felafax-demo" rel="nofollow">https://dub.sh/felafax-demo</a>) of our managed solution.Would love your thoughts on our repo and vision as we keep chugging along!

13 comments

nl8 months ago

I'm pretty sure anyone finetuning Lllama now on a regular basis is using <a href="https://github.com/unslothai/unsloth">https://github.com/unslothai/unsloth</a> so comparisons should be against that. The open source version is ~2x faster than default implementations. NVidia only, although the kernels are in Triton so might be portable.

评论 #41516588 未加载

评论 #41518105 未加载

reissbaker8 months ago

Very cool! Unlocking TPU training is a big win.FWIW, if this helps prioritize: personally I'd find LoRA training for Llama 3.1 most useful (which it sounds like currently isn't well-supported with Felafax?) since with something like vLLM you can serve large numbers of LoRAs that share the same underlying GPU resources (assuming they're based on the same base model), vs full finetunes where each model will need to deploy on its own set of GPUs. In general I would guess that full finetunes are going to be less cost effective for most enterprise use cases: finetuning — whether full-finetuning or PEFT — generally improves only task-specific performance, so assuming you've got more than one task you want to use a model for in your business, it'll pretty quickly become dramatically cheaper to do the tasks with LoRAs rather than full finetunes unless you're saturating the boxes for each specific task. So, I'm hoping you guys build support for LoRA training with JAX in addition to finetuning!

评论 #41521714 未加载

axpy9068 months ago

I am actually not surprised by JAX converting better to XLA. Also deep respect for anybody in this space as their is lot of complexity (?) to deal with at the framework and compiler level.

评论 #41515709 未加载

fbn798 months ago

I'm totally new to AI. If I take for example LLaMa 3.1 (small size 8B), what's the rough budget to fine tune it against for example 1GB of extra text data, in any cloud GPU service? (if compute time is not a problem, I can wait)

评论 #41518748 未加载

mandoline8 months ago

Do you have any apples-to-apples speed and cost comparisons across Nvidia vs. non-NVIDIA chips (as you mentioned: TPUs, Trainium, AMD GPUs)?

评论 #41514894 未加载

Palmik8 months ago

> For example, training and serving Llama 3.1 on Google TPUs is about 30% cheaper than NVIDIA GPUsWhen you say this, you should specify which Nvidia GPU you mean (I assume h100 SXM) and that price you are assuming for such GPU.One can't simply compare based on the on demand price on GCP, because the Nvidia GPUs there are extremely overpriced.

评论 #41518674 未加载

评论 #41517886 未加载

htrp8 months ago

What was the estimate for how much time you guys took to translate the torch to Jax vs how much you spent on XLA?

评论 #41514563 未加载

xrd8 months ago

Anyone want to comment on this versus the fine tune speedups from llama3.1 with unsloth?

评论 #41515939 未加载

tcdent8 months ago

Where in the codebase is the logic specific to TPU vs. CUDA?

评论 #41516825 未加载

ricw8 months ago

I’m surprised how it’s only 30% cheaper vs nvidia. How come? This seems to indicate that the nvidia premium isn’t as high as everybody makes it out to be.

评论 #41515113 未加载

评论 #41514966 未加载

khimaros8 months ago

an interesting thread with speculation about how to eventually do this on local TPUs with llama.cpp and GGUF infrastructure: <a href="https://www.reddit.com/r/LocalLLaMA/comments/12o96hf/has_anyone_used_llama_with_a_tpu_instead_of_gpu/?sort=new" rel="nofollow">https://www.reddit.com/r/LocalLLaMA/comments/12o96hf/has_any...</a>

评论 #41516379 未加载

评论 #41514947 未加载

faangguyindia8 months ago

For 99% case flash is enough. Period.

stroupwaffle8 months ago

You might want to change Road Runner logo because it’s definitely copyrighted

评论 #41515791 未加载

13 comments

nl8 months ago

评论 #41516588 未加载

评论 #41518105 未加载

reissbaker8 months ago

评论 #41521714 未加载

axpy9068 months ago

I am actually not surprised by JAX converting better to XLA. Also deep respect for anybody in this space as their is lot of complexity (?) to deal with at the framework and compiler level.

评论 #41515709 未加载

fbn798 months ago

评论 #41518748 未加载

mandoline8 months ago

Do you have any apples-to-apples speed and cost comparisons across Nvidia vs. non-NVIDIA chips (as you mentioned: TPUs, Trainium, AMD GPUs)?

评论 #41514894 未加载

Palmik8 months ago

评论 #41518674 未加载

评论 #41517886 未加载

htrp8 months ago

What was the estimate for how much time you guys took to translate the torch to Jax vs how much you spent on XLA?

评论 #41514563 未加载

xrd8 months ago

Anyone want to comment on this versus the fine tune speedups from llama3.1 with unsloth?

评论 #41515939 未加载

tcdent8 months ago

Where in the codebase is the logic specific to TPU vs. CUDA?

评论 #41516825 未加载

ricw8 months ago

I’m surprised how it’s only 30% cheaper vs nvidia. How come? This seems to indicate that the nvidia premium isn’t as high as everybody makes it out to be.

Show HN: Tune LLaMa3.1 on Google Cloud TPUs

13 comments

Show HN: Tune LLaMa3.1 on Google Cloud TPUs

13 comments