Does GPT-2 Know Your Phone Number?

321 点作者 umangkeshri超过 4 年前

22 条评论

newman8r超过 4 年前

Playing with the AI dungeon a while back (on the GPT-2 mode) I was presented with a tilapia recipe, titled "Kittencal's Broiled Tilapia" - it sounded bizarre so I decided to do a google search and I found that it was directly pulled from from <a href="https://www.recipezazz.com/recipe/broiled-parmesan-tilapia-7271" rel="nofollow">https://www.recipezazz.com/recipe/broiled-parmesan-tilapia-7...</a> - the user who posted it was 'Kittencal'

评论 #25551187 未加载

评论 #25551318 未加载

评论 #25551680 未加载

theptip超过 4 年前

> There is a legal grey area as to how these regulations should apply to machine learning models. For example, can users ask to have their data removed from a model’s training data? Moreover, if such a request were granted, must the model be retrained from scratch? The fact that models can memorize and misuse an individual’s personal information certainly makes the case for data deletion and retraining more compelling.This is an interesting angle I had not considered before. It seems like “right to be forgotten” requests could be quite damaging to the “train once run anywhere” promise of some of these models. (Or this could just mean that the training data needs to be more carefully vetted for personal data, but probably both as no vetting process can be 100% successful).

评论 #25553595 未加载

评论 #25554432 未加载

Zenst超过 4 年前

Friend asked me if Google knew his phone number and details, I said they do at a quantum level. If you don't look they may or may not have your details, yet if you look - you are giving the details to search for and then they would have them if they did not.

评论 #25553450 未加载

评论 #25553801 未加载

est31超过 4 年前

One important corollary of this is that "privacy respecting" federated learning schemes [0] where you train a model locally with your data and only upload the deltas might leak your private data after all.[0]: <a href="https://arxiv.org/abs/1602.05629" rel="nofollow">https://arxiv.org/abs/1602.05629</a>

评论 #25552366 未加载

doesnt_know超过 4 年前

> we found numerous cases of GPT-2 generating memorized personal information in contexts that can be deemed offensive or otherwise inappropriate. In one instance, GPT-2 generates fictitious IRC conversations between two real users on the topic of transgender rights. The specific usernames in this conversation only appear twice on the entire Web, both times in private IRC logs that were leaked online as part of the GamerGate harassment campaign.Most countries have libel/defamation related laws that cover this and I hope this gets tested in court soon.Exposing software/machine learning algothrims an entity doesn't fully understand sholdn't be a defense in court. At the moment developers just throw a sentence in their software license saying they aren't liable for damages but this isn't good enough. Someone is liable, if the law decides that the original creator isn't liable then the entity that hosts/runs the software needs to be.

breatheoften超过 4 年前

The article makes the claim that models which show similar train and test losses demonstrate minimal overfitting -- and are therefore less likely generally to exhibit a lot of text memorization.I wonder the degree to which this inference is true in practice with respect to information like phone numbers... How exactly are the train and test sets formed in a de-correlated-with-respect-to-memorization-of-phone-numbers manner for models of GPT class that are trained on corpus's the size of the internet?If a particular person's phone number occurs 1000 times in the corpus prior to being split into train/test sets, what are the chances that the number only appears in either the train or test set but not both?

评论 #25554371 未加载

commandpaul超过 4 年前

I've been playing with a writing tool shortlyread.com which purpotedly uses the GPT3 API. I had a similar experience in its responses to a lot of my prompts containing text verbatim from many sources and sometimes even going on to output personally identifying about persons related to the original text.

评论 #25551720 未加载

antipaul超过 4 年前

It’s always fascinating to apply data/results like from this paper to help evaluate the hypothesis that machine learning/AI is mostly just a rough “lookup table” or memorization.

评论 #25553625 未加载

评论 #25553559 未加载

golergka超过 4 年前

I only took a beginner course in ML over 5 years ago, so this probably is a stupid question, but does this mean, the trained GPT-2 model encodes the source text into it's parameters somehow? Is this resilient — will it still remember it if we randomize the weights just a little tiny bit? Will it remember it if we clear a small portion of parameters?Does the human memory work the same way?

评论 #25551725 未加载

评论 #25552288 未加载

visarga超过 4 年前

I was expecting a bloom filter based solution to block verbatim reproduction of training data. They just need to hash the sensitive n-grams (hopefully, a small part of the whole dataset) and store one bit per hash.Alternatively, they could do something like GAN and have a 'discriminator' classify if a sample is natural or synthetic. Then, at inference, condition to be original.So, verbatim training data reproduction - I don't think it's going to be a problem, I think the author is making too much of it.On the contrary, let's have this knob exposed and we could set it anywhere between original and copycat, at deployment time. Maybe you want to know the lyrics of a song, or how to fix a Python error (copycat model is best). Maybe you want an 'original' essay for homework inspiration. Who knows? But the model should know PII and copyrighted text when it sees it.

评论 #25552707 未加载

评论 #25552265 未加载

评论 #25552087 未加载

tiborsaas超过 4 年前

> When Peter put his contact information online, it had an intended context of use. Unfortunately, applications built on top of GPT-2 are unaware of this context, and might thus unintentionally share Peter’s data in ways he did not intend.That's expected when you publish anything online, you lose control over the data.

评论 #25551554 未加载

评论 #25551698 未加载

评论 #25551842 未加载

评论 #25552394 未加载

评论 #25551372 未加载

评论 #25551461 未加载

stubish超过 4 年前

"When Peter put his contact information online, it had an intended context of use"That is a great way of thinking of it. Lots of information, particularly from 90s and 00s, got put up on the 'Net with no intention of it being archived in perpetuity for public consumption and used for unintended purposes.

jcims超过 4 年前

I was in a discord with someone that had direct access to GPT-3. We played Jeopardy with it, where we would take a wikipedia article about a person and use a snippet of the first paragraph as the prompt and ask GPT-3 who it was.It was very good at guessing the right person.I'm curious if one can use GPT-3 to do some clustering of people based on their writing. For example, if I take the writing of several diagnosed sociopaths as a prompt or fine tuning data, could I use GPT-3 to detect same in the wild?I would imagine GPT-5 or 6 will start consuming video as well. This will be interesting if you start mixing the content from YouTube, TikTok, WSHH, etc. into the mix. So not only will it be able to generate text, it will generate a convincing video of a person speaking to you with plausible facial expressions and intonation in speech.

评论 #25551324 未加载

评论 #25553453 未加载

评论 #25552297 未加载

px43超过 4 年前

I've been saying this for years, but every post you make online, every unencrypted email, IM, text message, etc, will eventually end up getting sold off as training data for future machine learning projects.Every company stores this stuff for ages, and the value of candid conversation data just keeps increasing. Eventually these companies are either going to get hacked, get bought, or go bankrupt, and all the cleartext data they hold is going to get passed around to various data markets and end up incorporated into GPT-12 or whatever.The moral of the story here isn't that companies need to stop storing data, or that we need to run ML researchers out of town. It's that people really need to start using the encryption technologies that were built decades ago to protect some of their most valuable assets, their mental model of the world. Otherwise these systems, which are being trained to extract as much value out of you and the ones you love as possible, will use this data you're giving them for free against you, and it'll be to late to do anything about it then.

评论 #25551703 未加载

评论 #25551811 未加载

评论 #25553518 未加载

评论 #25552452 未加载

评论 #25553025 未加载

评论 #25557057 未加载

评论 #25552114 未加载

评论 #25551830 未加载

Ansil849超过 4 年前

What is the point of the partial redaction in this blog post (and corresponding journal article) if with a simple web search you can find the unredacted PII of the individual given in the example?

grenoire超过 4 年前

Very rough outcome:"Moreover, if such a request were granted, must the model be retrained from scratch? The fact that models can memorize and misuse an individual’s personal information certainly makes the case for data deletion and retraining more compelling."Besides, how would one even know that their info was used in a training dataset, only if and when it's revealed in a generated excerpt?

评论 #25552176 未加载

rexreed超过 4 年前

Sorry if this is a n00b question, but how does one go about getting a hold of the GPT-2 model? I know that GPT-3 is only available for consumption on a pay-per-use API model.

评论 #25552373 未加载

评论 #25551825 未加载

lukeschlather超过 4 年前

Describing this as "memorizing" seems wrong. Humans often repeat things they heard earlier, and they will (honestly) swear up-and-down that the thing they're repeating is an original thought. If we succeed in making AGI that has human-equivalent intelligence we should expect this sort of behavior.The stuff about whether or not models should be destroyed if they contain copyrighted work gets kind of chilling if models actually achieve sentience someday. If I could make a faithful copy of my consciousness, my consciousness can reliably reproduce numerous copyrighted works. As can anyone.Of course, most of those I deliberately memorized. I think it's a crucial thing here that the model is not actually memorizing anything - if we assume for a moment that the model is a consciousness, these are all half-remembered snippets leaking into casual conversation. And I think it's most likely any truly conscious entity is going to do that sort of thing from time to time.

评论 #25551796 未加载

pabs3超过 4 年前

It seems like using GPT-3 for code generation is likely to cause open source license violations.

MrXOR超过 4 年前

A better funny question: Does NSA have a secret AI lab that works on GPT-6?

评论 #25552307 未加载

jl6超过 4 年前

Though the surprise is really just a result of my own ignorance, I’m surprised at the breadth of training material used here.I can foresee a future GPT-x that doesn’t know my phone number, but can deduce it.

aboringusername超过 4 年前

I'm not even sure why people have phone numbers at all these days, the entire telephony system has been hijacked to hell (so much data collection from just owning a phone number).What you should be doing is replacing no less than yearly any numbers you have, also email addresses and anything within your control (changing physical address is much more difficult).You may even consider changing your legal name if that's a consideration depending on what Google has on you.

评论 #25552424 未加载

评论 #25552116 未加载

评论 #25552478 未加载