TechEcho

4 comments

visarga10 months ago

LOL, I was reading the abstract and remembering there used to be a paper like that. Then I look at the title and see it was from 2020. For a moment I thought someone plagiarised the original paper.<p>Unfortunately BERT models are dead. Even the cross between BERT and GPT - the T5 architecture (encode-decoder) is rarely used.<p>The issue with BERT is that you need to modify the network to adapt it to any task by creating a prediction head, while decoder models (GPT style) do every task with tokens and never need to modify the network. Their advantage is that they have a single format for everything. BERT's advantage is the bidirectional attention, but apparently large size decoders don't have an issue with unidirectionality.

评论 #40995947 未加载

评论 #41005526 未加载

评论 #40998027 未加载

评论 #40996652 未加载

评论 #40997591 未加载

cs70210 months ago

Good work by well-known reputable authors.<p>The gains in training efficiency and compute cost versus widely used text-encoding models like RoBERTa and XLNet are significant.<p>Thank you for sharing this on HN!

adw10 months ago

(2020)

trhway10 months ago

Reminds somewhat parallel from the classic expert systems - human experts shine at discrimination, and that is one of the most efficient methods of knowledge eliciting from them.

Electra: Pre-Training Text Encoders as Discriminators Rather Than Generators (2020)

4 comments

Electra: Pre-Training Text Encoders as Discriminators Rather Than Generators (2020)

4 comments