We have a reading club every Friday where we go over the fundamentals of a lot of the state of the art techniques used in Machine Learning today. Last week we dove into the "Vision Transformers" Paper from 2021 where the Google Brain team benchmarked training large scale transformers against ResNets.<p>Though it is not groundbreaking research as of this week, I think with the pace of AI it is important to dive deep into past work and what others have tried! It's nice to take a step back and learn the fundamentals as well as keeping up with the latest and greatest.<p>Posted the notes and recap here if anyone finds it helpful:<p><a href="https://blog.oxen.ai/arxiv-dives-vision-transformers-vit/" rel="nofollow noreferrer">https://blog.oxen.ai/arxiv-dives-vision-transformers-vit/</a><p>Also would love to have anyone join us live on Fridays! We've got a pretty consistent and fun group of 300+ engineers and researchers showing up.
I wonder if overlapping the patches would improve accuracy further as a way to kind of anti alias the data learned / inferred. In other words, if position 0 is 0,0 - 16,16 and position 1 is 16,0 - 32,16 instead we used 12,0-28,16 for position 1 where it overlaps 4 pixels of the previous position. You’d have more patches / it would be more expensive compute wise, but it might dealias any artificial aliasing that the patches create during both training and inference.