Can't find "apple to apple" comparison on performance on QWQ 32b (4bit), can anyone help me with decision on which solution to pick?<p>From what I dig so far it looks like dual Arc A770 is supported by llama.cpp. And saw some reports that llama.cpp on top of IPEX-LLM is fastest way for inference on intel card.<p>On the other end there is more expensive 7900 XTX on which AMD claims (Jan '25) that inference is faster than on 4090.<p>So - what is the state of the art as of today, how does one compare to another (apple to apple)? What is tokens/s diff?
I don't know but you'll probably find a better answer here:<p><a href="https://www.reddit.com/r/LocalLLaMA/" rel="nofollow">https://www.reddit.com/r/LocalLLaMA/</a><p>Using the search gave me a bunch of threads, but here's one:<p><a href="https://www.reddit.com/r/LocalLLaMA/comments/1ip6c9e/looking_to_buy_two_arc_a770_16gb_for_llm/" rel="nofollow">https://www.reddit.com/r/LocalLLaMA/comments/1ip6c9e/looking...</a>