Yi Tay's response (chief scientist at Reka AI, ex-Google Brain researcher): <a href="https://twitter.com/YiTayML/status/1783273130087289021" rel="nofollow">https://twitter.com/YiTayML/status/1783273130087289021</a><p>>not true, especially for language. if you trained a large & deep MLP language model with no self-attention, no matter how much data you'll feed it you'll still be lacking behind a transformer (with much less data). will it get to the same point? i don't think so. your tokens cannot even see each other in a raw MLP.<p>>on the other hand, tiny tweaks to transformers may not matter as much as data/compute. sure. but it's also not very accurate to say "architecture research" does not matter and "makes no difference". i hear this a lot about how people use this to justify not innovating at the architecture level.<p>>the truth is the community stands on the shoulder of giants of all the arch research that have been done to push the transformer to this state today.<p>>architecture research matters. many people just take it for granted these days.
I took the Andrew Ng Coursera machine learning course in 2015 and to this day I still remember him saying this in one of the videos. At the time he was talking about various versions/optimizations of gradient descent but he essentially said that tweaking the algorithm will only make your model ~1% better while doubling the amount of training data will have a substantially larger impact (use any old algorithm but just throw more data at the problem). That's why it was already evident back then that Google, Facebook, etc were sitting on a goldmine because in the long run those with the most data, not the brightest PhDs will win this race.
As a hobbyist having trained models for different use cases ranging from object detection and recognition to text completion to image generation, the best advice has consistently been to curate and annotate your dataset as perfectly as you can before worrying about anything else.<p>A small, well-curated, well-annotated dataset will always be orders of magnitude better than a gigantic one with even a tiny percentage of mislabeled features or bad/wrong data. Hyperparameters and such can be fiddled with once you know you are on the right the track and in the scheme of things are relatively minor for most purposes.<p>Of course, this advice gets routinely ignored as people spend countless hours fussing over how to set certain flags and grabbing as much data as possible, then carelessly throwing it all together and training it. Then, wondering why the model does things they don't want, they go back to messing with the parameters again.<p>It is a giant pain in the ass but you have to spend the time sitting in front of the screen going through the data and removing things and tagging things and making sure that the details are right. This is really what makes the good models good and the rest mediocre.
This makes me sad, not because I disagree with it, but because it's basically common wisdom in the statistical and ML communities (of practitioners). In my experience, the only people who think architecture/model choice makes a huge difference are n00bs and academics.<p>That being said, definitely if you use a linear model (like lasso) vs a tree based model (like XGBoost), you'll see differences, but once you have a flexible enough model and a <i>lot</i> of data, training time and inference complexity tend to become better ways to make a model choice.
I don’t get this: “What that means is not only that they learn what it means to be a dog or a cat, …“<p>We don’t have any dataset of dog or cat experience right? OP probably means that he models learns wat a dog or cat is, right?<p>I find the whole piece somewhat vague btw. No real insights if you ask me. Sure if all you put in is a dataset, that should be all you get out. What’s surprising (worth HN) here?
> It is a giant pain in the ass but you have to spend the time sitting in front of the screen going through the data and removing things and tagging things and making sure that the details are right. This is really what makes the good models good and the rest mediocre.<p>In some other comment I read this. Sounds very much like a curation thing. And now I'm wondering; isn't this part already covered by a lot of human beings now interacting with ChatGPT and the like?<p>My uneducated guess is that a company can scrape the whole world wide web and also have all the low quality content that comes with it, but then strengthen/curate their data and/or model by having it interact with humans? You give this thing a prompt, it comes up with some obvious nonsense, and then you as a human correct this by 'chatting' with it?
Has anyone tried removing an entire concept from a dataset and seeing if the LLM can reason its way into the concept?<p>I think that would be a really cool experiment.<p>There are probably some really good candidate concepts that just take a small leap of reasoning to reach.<p>But off the top of my head maybe multiplication? Or the concept of zero. Maybe the wheel?<p>Edit: if anyone is interesting in doing this kind of stuff, hit me up. (Email in profile). I want to start doing these kinds of things as a side project.
Alon Halevy, Peter Norvig, and Fernando Pereira (2009): The Unreasonable
Effectiveness of Data<p><a href="https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/35179.pdf" rel="nofollow">https://static.googleusercontent.com/media/research.google.c...</a>
Yes, and it's what people seem to ignore when they talk about dethroning GPT4 as the top LLM. It's good data expressly developed for training the behaviors they want that keeps them ahead, all the other stuff (other training and filtering web data) has much less of an impact.<p>See also "You won't train a better model from your desk: <a href="https://news.ycombinator.com/item?id=40155715">https://news.ycombinator.com/item?id=40155715</a>
This insight makes one wonder if the same thing applies to humans as well. Are we just the sum of our experiences? Or the architectures of our brains are much more complex and different so that they have more influence on the outputs for the same inputs?
That is what I have repeated so many times in the last 2 years over and over. I consider Yi Tay's response [1] a mere technicality that is actually irrelevant. What is relevant is how predictable "interpolatable" the data are, how predictable we are.<p>1. <a href="https://twitter.com/YiTayML/status/1783273130087289021" rel="nofollow">https://twitter.com/YiTayML/status/1783273130087289021</a>
Is this a surprise?<p>Isn't this exactly what Naftali Tishby has been talking about [1].<p>[1] <a href="https://www.youtube.com/watch?v=XL07WEc2TRI" rel="nofollow">https://www.youtube.com/watch?v=XL07WEc2TRI</a>
The only thing this glosses over is RL. I guess you can see agents interacting in environments as a type of "dataset", but it _feels_ different.