TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Mapping the semantic void: Strange goings-on in GPT embedding spaces

96 点作者 georgehill超过 1 年前

11 条评论

kridsdale1超过 1 年前
This is the first post in LessWrong that I found interesting and not insipid navel gazing.<p>My interpretation of the finding is that the LLM training has revealed the primordial nature of human language, akin to the primary function of genetic code predominately being to ensure the ongoing mechanical homeostasis of the organism.<p>Language’s evolutionary purpose is to drive group-fitness. Group management fundamentals are thus the predominant themes in the undefined latent space.<p>Who is in our tribe? Who should we trust? Who should we kill? Who leads us?<p>These matters likely dominated human speech for a thousand millennia.
评论 #38700094 未加载
评论 #38704808 未加载
评论 #38702407 未加载
great_psy超过 1 年前
So I am a bit confused about the part where you go k distance out from the centroids.<p>Since there are ~5000 dimensions, in which of those dimensions are we moving k out ?<p>Is the idea you just move, out, in all dimensions such that the final Euclidean distance is k ?<p>Seems that’s how they get multiple samples at those distances.<p>Either way I think it’s more interesting to go out in specific dimensions. Ideally there is a mapping between each dimension and something inherent about the token, like the part where a dimension corresponds with the first word of the token.<p>We went through this discovery phase when we were generating images using autoencoders, same idea, some of those dimensions would correspond to certain features of the image, so moving along them would change the image output in some predictable way.<p>Either way, I think the overall structure of those spaces says something about how the human brain works ( given we invented the language). I’m interested to see if anything neurologic can be derived from those vector embeddings.
评论 #38698463 未加载
评论 #38702916 未加载
chpatrick超过 1 年前
I think that&#x27;s really interesting but one thing that I&#x27;m not sure about is that a lot of tokens are just fragments of words like &quot;ei&quot;, not whole words. Can we really makes conclusions about those embeddings?
PaulHoule超过 1 年前
Should try asking it to define things in French. Korean, etc.
empath-nirvana超过 1 年前
That is _profoundly_ interesting, and I think a serious study of that is going to reveal a lot about human (or at least english speaking) psychology.
评论 #38699015 未加载
nybsjytm超过 1 年前
Based on the first three figures, it seems like the author isn&#x27;t familiar at all with probability in high-dimensional spaces, where these phenomena are exactly what you&#x27;d expect. Because of that, the figure with intersecting spheres is a big misinterpretation. I stopped reading there, but I think it&#x27;s good to generally ignore things you find on lesswrong!
评论 #38697768 未加载
评论 #38697747 未加载
JPLeRouzic超过 1 年前
Please can someone explain the gist of this article to a mere human with no PhD in AI?
评论 #38696871 未加载
评论 #38697175 未加载
评论 #38697705 未加载
评论 #38697626 未加载
评论 #38696243 未加载
flir超过 1 年前
Someone shoot me down if I&#x27;m wrong:<p>In this model (one of many infinitely many possible models) a word is a point in 4096-space.<p>This article is trying to tease out the structure of those points, and suggesting we might be able to conclude something about natural language from that structure - like looking at the large-scale structure of the Universe and deriving information about the Big Bang.<p>Obvious questions: is the large-scale structure conserved across languages?<p>What happens if we train on random tokens - a corpus of noise? Does structure still emerge?<p>It might be interesting, it might be an artifact. I&#x27;d be curious to know what happens when you only examine complete words.
评论 #38703047 未加载
kgc超过 1 年前
Would be interesting to see how changing the tokenization strategy — whole words or even phrases instead of traditional tokens — changes the results.
jakedahn超过 1 年前
Has anyone done this analysis for other llms like llama2?
harveywi超过 1 年前
Maybe a dumb question: Shouldn&#x27;t these be called immersions instead of embeddings?
评论 #38698244 未加载