A less hyped inference engine with INT8/FP16 inference supports on both CPU / GPU (cuda).<p>Model supports list:
GPT-2, GPT-J, GPT-NeoX, OPT, BLOOM, LLAMA, T5, WHISPER<p>( Found this library during my research on alternatives to triton/FasterTransformer in Tabby <a href="https://github.com/TabbyML/tabby">https://github.com/TabbyML/tabby</a>)