TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

How fast can one reasonably expect to get inference on a ~70B model?

9 pointsby yungtriggz12 months ago
I&#x27;ve been playing around with deploying different large models on various platforms (HF, AWS etc) for testing and have been underwhelmed by the inference speeds I&#x27;ve been able to achieve. They&#x27;re fine (though considerably slower than OpenAI) but nothing like what I feel I&#x27;ve been led to believe by others who talk about how frighteningly fast their self-hosted models are.<p>For reference, I get responses in: ~1200ms from gpt-3.5-turbo, ~1600ms from gpt-4o ~5000ms from llama-70b-instruct on dedicated HF endpoint<p>I&#x27;ve been using standard Nvidia A100, 4x GPU, 320 GB for these deployments and so I&#x27;m now wondering, am I missing something or were my expectations just unreasonable? Curious to hear any of your thoughts, experiences, and tips&#x2F;tricks, thanks.

4 comments

mks_shuffle12 months ago
You can try Groq API for faster inference. They use custom hardware to speed up the inference. Supported open models can be found here: <a href="https:&#x2F;&#x2F;console.groq.com&#x2F;docs&#x2F;models" rel="nofollow">https:&#x2F;&#x2F;console.groq.com&#x2F;docs&#x2F;models</a> (includes llama-70b)
评论 #40493141 未加载
agr_nyc12 months ago
We are getting a forward pass time of ~100ms on Meta&#x27;s original Llama2 70B (float16, batch size 8) PyTorch implementation on 8xA100. Those results are very underwhelming in terms of fully utilizing the GPU flops. If we are doing something wrong, let me know.<p>The vllm implementation is much faster, I think 50ms or better on either 4 or 8 A100s, forget the exact number.
评论 #40493148 未加载
kkielhofner12 months ago
TensorRT-LLM with Triton Inference Server is the fastest in Nvidia land.<p><a href="https:&#x2F;&#x2F;github.com&#x2F;triton-inference-server&#x2F;tensorrtllm_backend">https:&#x2F;&#x2F;github.com&#x2F;triton-inference-server&#x2F;tensorrtllm_backe...</a>
uptownfunk12 months ago
Dumb q- have you profiled the inference execution? Where are the bottlenecks you are observing?