Hey! I am an AI engineer and I currently try to setup an endpoint on GPU to make inference on GTE embeddings model.
Currently our price per 1k tokens is exactly like openai ada 2<p>I did ONNX runtime inference on runpod.io so we pay per seconds.
I know it is theoretically possible to cut the cost much more, but I am struggling with the amount of experiments I can do.<p>I wonder if there is anyone who could help me figure out low level GPU nvdidia optimisation stuff?<p>Please leave a DM here if you feel like you have expertise and can help!
https://x.com/karmedge