科技回声

7 条评论

spxneo大约 1 年前

this is quite worrying for OpenAI as the rate token prices have been plummeting thanks to Meta and its going to have to keep cutting its prices while capex remains flat. whatever Sam says in interviews just think the opposite and the whole picture comes together.<p>It's almost a mathematical certainty that people who invested in OpenAI will need to reincarnate in multiple universes to ever see that money again but no bother many are probably NVIDIA stock holders to even out the damage.

评论 #40262200 未加载

评论 #40262193 未加载

评论 #40262125 未加载

评论 #40262107 未加载

评论 #40262365 未加载

评论 #40262395 未加载

评论 #40263508 未加载

评论 #40262213 未加载

modeless大约 1 年前

I don't need exact results. FP8 quantization is almost lossless and even 6-bit quantization is usually acceptable. Can this be combined with quantization?

评论 #40262633 未加载

评论 #40263049 未加载

freeqaz大约 1 年前

So this is 8x faster for serving these models than before? Or is this about it being more deterministic? I can't quite tell from reading it.

评论 #40262104 未加载

aussieguy1234大约 1 年前

I'm looking at buying 2 X RTX 3060s to run LLama 70b for my new PC I just purchased.<p>Will this work, or do I need a Tesla P40 or two?

评论 #40263335 未加载

评论 #40262170 未加载

评论 #40263798 未加载

thelittleone大约 1 年前

Other than portability and privacy, are there any benefits to running a local model with a 4090, versus running the same model on-demand on a cloud service with the same or more powerful card?

评论 #40262248 未加载

评论 #40262338 未加载

评论 #40262412 未加载

评论 #40263084 未加载

zwaps大约 1 年前

Is it me or is this paper basically missing all technical information?<p>I get that Therese proprietary technology, but if so, can we please not put this on arxiv and pretend it’s a scientific contribution?

评论 #40263452 未加载

halyconWays大约 1 年前

Someone get this into koboldcpp

7 条评论

spxneo大约 1 年前

评论 #40262200 未加载

评论 #40262193 未加载

评论 #40262125 未加载

评论 #40262107 未加载

评论 #40262365 未加载

评论 #40262395 未加载

评论 #40263508 未加载

评论 #40262213 未加载

modeless大约 1 年前

I don't need exact results. FP8 quantization is almost lossless and even 6-bit quantization is usually acceptable. Can this be combined with quantization?

评论 #40262633 未加载

评论 #40263049 未加载

freeqaz大约 1 年前

So this is 8x faster for serving these models than before? Or is this about it being more deterministic? I can't quite tell from reading it.

评论 #40262104 未加载

aussieguy1234大约 1 年前

I'm looking at buying 2 X RTX 3060s to run LLama 70b for my new PC I just purchased.<p>Will this work, or do I need a Tesla P40 or two?

评论 #40263335 未加载

评论 #40262170 未加载

评论 #40263798 未加载

thelittleone大约 1 年前

Other than portability and privacy, are there any benefits to running a local model with a 4090, versus running the same model on-demand on a cloud service with the same or more powerful card?

SEQUOIA: Exact Llama2-70B on an RTX4090 with half-second per-token latency

7 条评论

SEQUOIA: Exact Llama2-70B on an RTX4090 with half-second per-token latency

7 条评论