TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Llama 2 chat with vLLM and tensor parallel guide

1 pointsby paulcjhover 1 year ago

1 comment

paulcjhover 1 year ago
Hope that you enjoy the guide, below is also some cost&#x2F;speed comparisons for running the models with vLLM:<p>- 7B, 1x A100, 25GB VRAM, 49 tok&#x2F;s, $0.0113 &#x2F;1k tok - 13B, 1x A100, 37GB VRAM, 32 tok&#x2F;s, $0.0174 &#x2F;1k tok - 70B, 2x A100, 150GB VRAM, 13 tok&#x2F;s, $0.128 &#x2F;1k tok