科技回声

5 条评论

juancn3 个月前

The compute scheduling part of the paper is also vey good, the way they balanced load to keep compute and communication in check.There is also a lot of thought put into all the tiny bits of optimization to reduce memory usage, using FP8 effectively without significant loss of precision nor dynamic range.None of the techniques by themselves are really mind blowing, but the whole of it is very well done.The DeepSeekV3 paper is really a good read: <a href="https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSeek_V3.pdf">https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSee...</a>

评论 #42857269 未加载

评论 #42858363 未加载

评论 #42858230 未加载

ilaksh3 个月前

Why is it that the larger models are better at understanding and following more and more complex instructions. And generally just smarter?With DeepSeek we can now run on non-GPU servers with a lot of RAM. But surely quite a lot of the 671 GB or whatever is knowledge that is usually irrelevant?I guess what I sort of am thinking of is something like a model that comes with its own built in vector db and search as part of every inference cycle or something.But I know that there is something about the larger models that is required for really intelligent responses. Or at least that is what it seems because smaller models are just not as smart.If we could figure out how to change it so that you would rarely need to update the background knowledge during inference and most of that could live on disk, that would make this dramatically more economical.Maybe a model could have retrieval built in, and trained on reducing the number of retrievals the longer the context is. Or something.

评论 #42858480 未加载

评论 #42858307 未加载

评论 #42858255 未加载

评论 #42860010 未加载

评论 #42858481 未加载

评论 #42864317 未加载

whimsicalism3 个月前

none of these techniques except MLA are new

评论 #42855974 未加载

评论 #42856652 未加载

评论 #42856165 未加载

评论 #42855829 未加载

doener3 个月前

I hate it so much that HN automatically removes some words in headlines like „how.“ You can add them after posting though for a while by editing the headline.

评论 #42855832 未加载

1970-01-013 个月前

Has DeepSeek challenged the very weird hallucination problem? Reducing hallucinations now seems to be the remaining fundamental issue that needs scientific research. Everything else feels like an engineering problem.

评论 #42856066 未加载

评论 #42856248 未加载

评论 #42856884 未加载

评论 #42856025 未加载

评论 #42863224 未加载

评论 #42858519 未加载

评论 #42855935 未加载

5 条评论

juancn3 个月前

评论 #42857269 未加载

评论 #42858363 未加载

评论 #42858230 未加载

ilaksh3 个月前

评论 #42858480 未加载

评论 #42858307 未加载

评论 #42858255 未加载

评论 #42860010 未加载

评论 #42858481 未加载

评论 #42864317 未加载

whimsicalism3 个月前

none of these techniques except MLA are new

评论 #42855974 未加载

评论 #42856652 未加载

评论 #42856165 未加载

评论 #42855829 未加载

doener3 个月前

I hate it so much that HN automatically removes some words in headlines like „how.“ You can add them after posting though for a while by editing the headline.

评论 #42855832 未加载

1970-01-013 个月前

评论 #42856066 未加载

评论 #42856248 未加载

评论 #42856884 未加载

评论 #42856025 未加载

评论 #42863224 未加载

评论 #42858519 未加载

评论 #42855935 未加载

How has DeepSeek improved the Transformer architecture?

5 条评论

How has DeepSeek improved the Transformer architecture?

5 条评论