TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Transformers Are All You Need

13 pointsby takiwatangaabout 3 years ago

3 comments

riotnrrdabout 3 years ago
The performance advantages of transformer architectures have little to do with &quot;transformers&quot; themselves, but rather other architecture changes that were made along the way. They can be matched or beat in performance by tweaked conventional CNNs: <a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2201.03545" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2201.03545</a>
评论 #31021543 未加载
mrintellectualabout 3 years ago
&gt; ViT models are outperforming CNNs in terms of computational efficiency and accuracy, achieving highly competitive performance in tasks like image classification, object detection, and semantic image segmentation.<p>Since then, this has been show to be untrue. Using more modern training techniques along with depthwise convolutions (<a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2201.03545" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2201.03545</a>) results in equal if not better performance on vision tasks. Improved training methodologies have also been shown to boost the accuracy of ResNet50 - an 6-year-old pure convolutional architecture - on ImageNet-1k by over 5% (<a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2110.00476" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2110.00476</a>).<p>Pure ViTs are also more difficult to train when compared with traditional convnets, although this has since then been somewhat remedied by Swin (<a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2103.14030" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2103.14030</a>).
beernetabout 3 years ago
&gt; Unlike BERT, GPT models are unidirectional, and their main advantage is the magnitude of data they were pretrained on: GPT-3, the third-generation GPT model, was trained on 175 billion parameters, about 10 times the size of previous models.<p>Although the number of parameters of a Transformer correlates with the amount of training data, this statement is misleasing. Precisely, the model wasn&#x27;t trained on parameters but its parameters were trained on data.