I've been evaluating running non-quantized models on a Google Cloud instance with various GPUs.<p>To run a `vllm` backed Llama 2 7b model[1], start a Debian 11 <i>spot</i> instance, with (1) Nvidia L4 and a g2-standard-8 w/100GB of SSD disk (ignoring the advice to use a Cuda installer image):<p><pre><code> sudo apt-get update -y
sudo apt-get install build-essential -y
sudo apt-get install linux-headers-$(uname -r) -y
wget https://developer.download.nvidia.com/compute/cuda/11.8.0/local_installers/cuda_11.8.0_520.61.05_linux.run
sudo sh cuda_11.8.0_520.61.05_linux.run # ~5 minutes, install defaults, type 'accept'/return
sudo apt-get install python3-pip -y
sudo pip install --upgrade huggingface_hub
# skip using token as git credential
huggingface-cli login (for Meta model access paste token from HF[2])
sudo pip install vllm # ~8 minutes
</code></pre>
Then, edit the test code for a 7b Llama 2 model (paste into llama.py):<p><pre><code> from vllm import LLM
llm = LLM(model="meta-llama/Llama-2-7b-hf")
output = llm.generate("The capital of Brazil is called")
print(output)
</code></pre>
Spot price for this deployment is ~$225/month. The instance will eventually be terminated by Google, so plan accordingly.<p>[1] <a href="https://vllm.readthedocs.io/en/latest/models/supported_models.html" rel="nofollow noreferrer">https://vllm.readthedocs.io/en/latest/models/supported_model...</a>
[2] <a href="https://huggingface.co/settings/tokens" rel="nofollow noreferrer">https://huggingface.co/settings/tokens</a>