TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Ask HN: Why is Ilya saying data is limited when the whole world is data?

12 点作者 georgestrakhov5 个月前
In his recent talk Ilya S. said that the data running out is a fundamental constraint on the scaling laws. He said &quot;we have but one internet&quot;.<p>But I don&#x27;t understand: there is so much data in the real world beyond the internet. Webcams. Microphones. Cars. Robots... Everything can collect multimodal data and more importantly (for robots) even get feedback loops from reality.<p>So isn&#x27;t data functionally infinite? And the only thing standing in the way is the number of sensors and open datastreams and datasets.<p>Please help me understand

8 条评论

m_ke5 个月前
There&#x27;s a ton of recent work on data curation &#x2F; synthetic data generation that shows that smaller high quality datasets go a lot further than scaling up on noisy web data.<p>The scaling law plots are log scale so to get more juice with naive scaling we&#x27;d need to invest exponentially more resources, and we&#x27;re at a point where the juice is not worth the squeeze, so people will shift to moving the curve down with new architectures, better curated datasets and test time compute &#x2F; RL.<p>See:<p>- FineWeb: <a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2406.17557" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2406.17557</a><p>- Phi-4: <a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2412.08905" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2412.08905</a><p>- DataComp: <a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2406.11794" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2406.11794</a>
评论 #42448376 未加载
评论 #42428979 未加载
unsupp0rted5 个月前
If you show an LLM one webcam feed to train on, that&#x27;s useful. Two is even more useful. But there are diminishing returns. &quot;Useful training data&quot; is limited.<p>Humans get to PhD level with barely a drop of training data compared to what LLMs are trained on.<p>If there were infinite useful data, then scaling AI on data would make sense. Since there isn&#x27;t, the way forward is getting more efficient at using the data we have.
评论 #42424950 未加载
A_D_E_P_T5 个月前
Think about it from his perspective.<p>Data from the internet can be chunked, sorted, easily processed, and has a relatively high signal-to-noise ratio. Data from a webcam or a microphone -- if even legal to access in the first place -- would be a mess. Imagine chunking and processing 5TB of that sort of data. Seems to me that the effort would far outweigh the reward.<p>Robots are a different problem entirely. It&#x27;s darkly amusing that simple problems of motion through space are more complex to replicate than painting the simulacra of a masterpiece, or acing the medical licensing exam. We&#x27;ll probably have AGI before we can mimic the movement of the simple housefly.
评论 #42425016 未加载
sk110015 个月前
You’re thinking “any data”, he’s thinking “useful data for training an LLM”.
评论 #42424933 未加载
wef225 个月前
There are lots of problems where someone has to run experiments to generate data. If the most optimized possible process to perform the experiment is expensive and takes time to generate 1 data point, then all you can do is wait till more data is produced before a solution is found. Think drug discovery.
评论 #42424976 未加载
EncryptedMan5 个月前
Much of the user-generated data stored by tech companies is proprietary, which limits access by external parties.
farseer5 个月前
What about all the books written since antiquity?
ganzuul5 个月前
Complex systems studies is wisdom. We know how communication on internet behaves. Conway&#x27;s Law hits hard and the processes of life are not dumb.<p>Access to physical reality is important when negotiating with the beings that can form under this constraint. People have apparently known this instinctively for a very long time and they are not going to give in to the demands of the AI industry.<p>It&#x27;s a great mistake to humanize everything in your consciousness.