TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

The "it" in AI models is the dataset

108 pointsby alvivarabout 1 year ago

18 comments

mnk47about 1 year ago
Yi Tay&#x27;s response (chief scientist at Reka AI, ex-Google Brain researcher): <a href="https:&#x2F;&#x2F;twitter.com&#x2F;YiTayML&#x2F;status&#x2F;1783273130087289021" rel="nofollow">https:&#x2F;&#x2F;twitter.com&#x2F;YiTayML&#x2F;status&#x2F;1783273130087289021</a><p>&gt;not true, especially for language. if you trained a large &amp; deep MLP language model with no self-attention, no matter how much data you&#x27;ll feed it you&#x27;ll still be lacking behind a transformer (with much less data). will it get to the same point? i don&#x27;t think so. your tokens cannot even see each other in a raw MLP.<p>&gt;on the other hand, tiny tweaks to transformers may not matter as much as data&#x2F;compute. sure. but it&#x27;s also not very accurate to say &quot;architecture research&quot; does not matter and &quot;makes no difference&quot;. i hear this a lot about how people use this to justify not innovating at the architecture level.<p>&gt;the truth is the community stands on the shoulder of giants of all the arch research that have been done to push the transformer to this state today.<p>&gt;architecture research matters. many people just take it for granted these days.
评论 #40157478 未加载
评论 #40155731 未加载
评论 #40156860 未加载
评论 #40156313 未加载
评论 #40155875 未加载
评论 #40155972 未加载
tppiotrowskiabout 1 year ago
I took the Andrew Ng Coursera machine learning course in 2015 and to this day I still remember him saying this in one of the videos. At the time he was talking about various versions&#x2F;optimizations of gradient descent but he essentially said that tweaking the algorithm will only make your model ~1% better while doubling the amount of training data will have a substantially larger impact (use any old algorithm but just throw more data at the problem). That&#x27;s why it was already evident back then that Google, Facebook, etc were sitting on a goldmine because in the long run those with the most data, not the brightest PhDs will win this race.
评论 #40157916 未加载
评论 #40168978 未加载
Eisensteinabout 1 year ago
As a hobbyist having trained models for different use cases ranging from object detection and recognition to text completion to image generation, the best advice has consistently been to curate and annotate your dataset as perfectly as you can before worrying about anything else.<p>A small, well-curated, well-annotated dataset will always be orders of magnitude better than a gigantic one with even a tiny percentage of mislabeled features or bad&#x2F;wrong data. Hyperparameters and such can be fiddled with once you know you are on the right the track and in the scheme of things are relatively minor for most purposes.<p>Of course, this advice gets routinely ignored as people spend countless hours fussing over how to set certain flags and grabbing as much data as possible, then carelessly throwing it all together and training it. Then, wondering why the model does things they don&#x27;t want, they go back to messing with the parameters again.<p>It is a giant pain in the ass but you have to spend the time sitting in front of the screen going through the data and removing things and tagging things and making sure that the details are right. This is really what makes the good models good and the rest mediocre.
评论 #40155539 未加载
评论 #40155590 未加载
disgruntledphd2about 1 year ago
This makes me sad, not because I disagree with it, but because it&#x27;s basically common wisdom in the statistical and ML communities (of practitioners). In my experience, the only people who think architecture&#x2F;model choice makes a huge difference are n00bs and academics.<p>That being said, definitely if you use a linear model (like lasso) vs a tree based model (like XGBoost), you&#x27;ll see differences, but once you have a flexible enough model and a <i>lot</i> of data, training time and inference complexity tend to become better ways to make a model choice.
评论 #40156547 未加载
评论 #40157691 未加载
评论 #40159509 未加载
评论 #40157912 未加载
评论 #40155313 未加载
teekertabout 1 year ago
I don’t get this: “What that means is not only that they learn what it means to be a dog or a cat, …“<p>We don’t have any dataset of dog or cat experience right? OP probably means that he models learns wat a dog or cat is, right?<p>I find the whole piece somewhat vague btw. No real insights if you ask me. Sure if all you put in is a dataset, that should be all you get out. What’s surprising (worth HN) here?
评论 #40157412 未加载
评论 #40155947 未加载
rambambramabout 1 year ago
&gt; It is a giant pain in the ass but you have to spend the time sitting in front of the screen going through the data and removing things and tagging things and making sure that the details are right. This is really what makes the good models good and the rest mediocre.<p>In some other comment I read this. Sounds very much like a curation thing. And now I&#x27;m wondering; isn&#x27;t this part already covered by a lot of human beings now interacting with ChatGPT and the like?<p>My uneducated guess is that a company can scrape the whole world wide web and also have all the low quality content that comes with it, but then strengthen&#x2F;curate their data and&#x2F;or model by having it interact with humans? You give this thing a prompt, it comes up with some obvious nonsense, and then you as a human correct this by &#x27;chatting&#x27; with it?
评论 #40156053 未加载
评论 #40156301 未加载
评论 #40157116 未加载
bilsbieabout 1 year ago
Has anyone tried removing an entire concept from a dataset and seeing if the LLM can reason its way into the concept?<p>I think that would be a really cool experiment.<p>There are probably some really good candidate concepts that just take a small leap of reasoning to reach.<p>But off the top of my head maybe multiplication? Or the concept of zero. Maybe the wheel?<p>Edit: if anyone is interesting in doing this kind of stuff, hit me up. (Email in profile). I want to start doing these kinds of things as a side project.
评论 #40156273 未加载
评论 #40156760 未加载
sampoabout 1 year ago
Alon Halevy, Peter Norvig, and Fernando Pereira (2009): The Unreasonable Effectiveness of Data<p><a href="https:&#x2F;&#x2F;static.googleusercontent.com&#x2F;media&#x2F;research.google.com&#x2F;en&#x2F;&#x2F;pubs&#x2F;archive&#x2F;35179.pdf" rel="nofollow">https:&#x2F;&#x2F;static.googleusercontent.com&#x2F;media&#x2F;research.google.c...</a>
评论 #40155377 未加载
andy99about 1 year ago
Yes, and it&#x27;s what people seem to ignore when they talk about dethroning GPT4 as the top LLM. It&#x27;s good data expressly developed for training the behaviors they want that keeps them ahead, all the other stuff (other training and filtering web data) has much less of an impact.<p>See also &quot;You won&#x27;t train a better model from your desk: <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=40155715">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=40155715</a>
评论 #40156351 未加载
pyinstallwoesabout 1 year ago
So &quot;it&quot; is the collective unconscious of humanity? The egregore of us all, our collective spirit? I see.
zer0gravityabout 1 year ago
This insight makes one wonder if the same thing applies to humans as well. Are we just the sum of our experiences? Or the architectures of our brains are much more complex and different so that they have more influence on the outputs for the same inputs?
评论 #40159272 未加载
tadalaabout 1 year ago
Ah the nature vs nurture debate, we meet again!<p>Give me a Neural Net in its first epoch and I shall mold it into anything!
pk-protect-aiabout 1 year ago
That is what I have repeated so many times in the last 2 years over and over. I consider Yi Tay&#x27;s response [1] a mere technicality that is actually irrelevant. What is relevant is how predictable &quot;interpolatable&quot; the data are, how predictable we are.<p>1. <a href="https:&#x2F;&#x2F;twitter.com&#x2F;YiTayML&#x2F;status&#x2F;1783273130087289021" rel="nofollow">https:&#x2F;&#x2F;twitter.com&#x2F;YiTayML&#x2F;status&#x2F;1783273130087289021</a>
chrisdirlabout 1 year ago
Is the secret sauce also tied to the generation distribution which can differ from the dataset distribution e.g. RLHF?
troq13about 1 year ago
Weak argument for something everyone already knew. Nice you work at openAI, I guess.
tilt_errorabout 1 year ago
Is this a surprise?<p>Isn&#x27;t this exactly what Naftali Tishby has been talking about [1].<p>[1] <a href="https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=XL07WEc2TRI" rel="nofollow">https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=XL07WEc2TRI</a>
iNicabout 1 year ago
The only thing this glosses over is RL. I guess you can see agents interacting in environments as a type of &quot;dataset&quot;, but it _feels_ different.
redwoodabout 1 year ago
AKA &quot;group think&quot;