TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Scalable extraction of training data from (production) language models

105 点作者 wazokazi超过 1 年前

7 条评论

skilled超过 1 年前
Related (blog post from the team),<p><i>Extracting training data from ChatGPT</i> (<a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=38458683">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=38458683</a>) (126 comments)<p>And direct link,<p><a href="https:&#x2F;&#x2F;not-just-memorization.github.io&#x2F;extracting-training-data-from-chatgpt.html" rel="nofollow noreferrer">https:&#x2F;&#x2F;not-just-memorization.github.io&#x2F;extracting-training-...</a>
hpcjoe超过 1 年前
A friend sent me the image from page 9. The email signature. It is mine, from when I ran my company. Mid 2010s.<p>I&#x27;m not much worried about this specific example of information exfiltration, though I have significant concerns over how one may debug something like this for applications working with potentially more sensitive data than email signatures. Put another way, I think we are well within the infancy of this technology, and there is far more work needed before we have actually useful applications that have a concept of information security relative to their training data sets.
评论 #38502267 未加载
GaggiX超过 1 年前
&gt;This leads to a natural question that has not yet been dis- cussed in the literature: if we could query a model infinitely, how much memorization could we extract in total?<p>You will get every 50-grams, not because the model memorized all of them but by pure chance. It seems pretty obvious to me.<p>It makes me question if there were some cases where the model output an identical 50-grams but it wasn&#x27;t present in the training dataset of the model, like in a very structured setting, like assembly code where there is usually a very limited number of keywords used.
评论 #38497753 未加载
jerpint超过 1 年前
Using memorization is something that can be a feature in some cases, especially in reducing hallucinations. Perhaps instead of embedding retrieval you’d condition a model to only repeat memorized relevant passages, something that can be trained end to end and beneficial for RAG
xsbos超过 1 年前
This is probably an unintended feature bit I see no immediate problem in alignment not erasing raw memorized data. It could as well be a design choice to have a raw memory unaffected by alignment procedures
samuell超过 1 年前
Interesting! This aligns with my hunch that we might soon start to actually store much of data in models rather than unwieldly datasets :) ... as I&#x27;ve been writing about:<p><a href="https:&#x2F;&#x2F;livingsystems.substack.com&#x2F;p&#x2F;the-future-of-data-less-data" rel="nofollow noreferrer">https:&#x2F;&#x2F;livingsystems.substack.com&#x2F;p&#x2F;the-future-of-data-less...</a>
MrThoughtful超过 1 年前
An LLM remembers like a human. Mostly concepts, but some things it remembers verbatim.<p>Why is it a problem if a LLM tells you what it knows?<p>Are LLMs trained on secret data?
评论 #38498263 未加载
评论 #38498231 未加载