科技回声

Hey! I am an AI engineer and I currently try to setup an endpoint on GPU to make inference on GTE embeddings model. Currently our price per 1k tokens is exactly like openai ada 2<p>I did ONNX runtime inference on runpod.io so we pay per seconds. I know it is theoretically possible to cut the cost much more, but I am struggling with the amount of experiments I can do.<p>I wonder if there is anyone who could help me figure out low level GPU nvdidia optimisation stuff?<p>Please leave a DM here if you feel like you have expertise and can help! https://x.com/karmedge

Ask HN: GPU Inference Optimisation

暂无评论

Ask HN: GPU Inference Optimisation

暂无评论