TechEcho

12 comments

At first glance, this seems like one of the more interesting projects to come out of Facebook AI. Justification: In the future, AI models will increasingly become interwoven with tech. It's not going to be so much "AI programming" as just "programming".That raises an interesting question – one that has bothered me for a long time: Who owns copyright on training data?As we saw with Clearview AI, a lot of data is being used without consent or even knowledge of the creators. And it's extremely hard to detect this usage, let alone enforce rights on it.I might be misunderstanding this work, but it seems like this would give you the ability to mark your digital data in such a way that you could prove it was later used in a model.Unfortunately, it's not that simple. You don't have access to the models (normally). And I'm betting that this work is somehow domain-specific, meaning you can't really come up with a generalized marker to imprint on all your data.But this implies you might be able to mark your data with many such markers, in hopes that one of them will later be triggered:We also designed the radioactive data method so that it is extremely difficult to detect whether a data set is radioactive and to remove the marks from the trained model.The flipside is interesting, too: This might give companies yet another way of tracking users. Now you can check whether a given user was in your model's actual training set, and if not, fine-tune the model on the fly.Looking forward to seeing what comes of this.

评论 #22251594 未加载

评论 #22251807 未加载

zzbn00over 5 years ago

Not relevant to the main trust of the article but barium sulphate is not radioactive, it just efficiently absorbs X-rays. Radioactive markers are I believe most commonly used in PET scans, Wikipedia suggests flourine-18 as the common isotope used.

评论 #22260553 未加载

ISLover 5 years ago

It is hard to believe that modifying input datasets won't modify the qualitative behavior of the outputs in some way.This appears to be a modern variation of the <a href="https://en.wikipedia.org/wiki/Fictitious_entry" rel="nofollow">https://en.wikipedia.org/wiki/Fictitious_entry</a> / copy-trap behavior that mapmakers have made in the past.

评论 #22250599 未加载

SeriousMover 5 years ago

> Radioactive data could also help protect against the misuse of particular data sets in machine learning.This last sentence is the real reason behind this technology. Training data isn't cheap and I'm sure the paying party needs a watermark on it.

评论 #22250762 未加载

Nextgridover 5 years ago

The question would be whether it’s possible to make one’s behavioural data (online or offline) “radioactive” to then prove with a high degree of accuracy whether someone (like Facebook) is stalking you online to deliver targeted ads.At the moment advertising providers use a lot of data for ad targeting, some of which is benign and/or acquired with informed consent. As a result it makes it impossible for the user to tell whether an ad was targeted to them based on data they consented to share or if the data used was data they didn’t want to be collected or used for advertising purposes.

评论 #22250074 未加载

heavyset_goover 5 years ago

Large companies have no problem scraping data to be used to train their models, but they don't seem to feel the same way about you scraping theirs.

c1ccccc1over 5 years ago

I'm surprised that it's even necessary to modify the dataset to achieve this. From what I've read, large models will often memorize their training data, and it seems like even with smaller models it should be possible to tell whether or not it was trained with some set of images, simply because the loss will be lower.

评论 #22251131 未加载

评论 #22250265 未加载

eganistover 5 years ago

Not mentioned thus far anywhere in the article or in comments: potentially weaponizing this against deep fakes.What's to stop cameras from making raw photos "radioactive" from now on, making deepfakes traceable by tainting the image-sets on which the models generating the deepfakes were trained?This isn't my field. I'm certain there's a workaround, but I'd suspect detecting sufficiently well-placed markers would require knowing the original data pre-mark, which should be impossible if the data is marked before it's written to camera storage. I haven't even fully thought out the logistics yet, such as how to identify the radioactive data.But am I missing something? I feel like this is viable.

评论 #22253033 未加载

antplsover 5 years ago

Ctrl+f shows no mention of the study of post-processing quantization nor pruning on their tampered dataset.Overall, I instinctively think that one can create an NN architecture that is not affected, or even easily detect the tampered pictures with a pre processing pass, and untamper them.NN are actually fuzzy, they support noise, you could add a bit more noise in the dataset to defeat the "radioactiveness".Also, I'm pretty sure Facebook is not doing it to protect user data, but I have no proof.

评论 #22260680 未加载

mring33621over 5 years ago

Have not yet read the article, as Facebook is blocked at work, but I would guess that this is mostly the application of steganographic techniques, to hide known patterns, in datasets that are likely to be stolen/borrowed for training.Then observe the outputs of said models to try to discern related patterns.

applecrazyover 5 years ago

I can see a similar technique in use to detect cheaters in Kaggle competitions.

kragenover 5 years ago

This is a major plot point in Accelerando.

评论 #22260432 未加载

12 comments

sillysaurusxover 5 years ago

评论 #22251594 未加载

评论 #22251807 未加载

zzbn00over 5 years ago

评论 #22260553 未加载

ISLover 5 years ago

评论 #22250599 未加载

SeriousMover 5 years ago

评论 #22250762 未加载

Nextgridover 5 years ago

评论 #22250074 未加载

heavyset_goover 5 years ago

Large companies have no problem scraping data to be used to train their models, but they don't seem to feel the same way about you scraping theirs.

c1ccccc1over 5 years ago

评论 #22251131 未加载

评论 #22250265 未加载

eganistover 5 years ago

评论 #22253033 未加载

antplsover 5 years ago

评论 #22260680 未加载

mring33621over 5 years ago

applecrazyover 5 years ago

I can see a similar technique in use to detect cheaters in Kaggle competitions.

kragenover 5 years ago

This is a major plot point in Accelerando.

评论 #22260432 未加载

Using ‘radioactive data’ to detect if a data set was used for training

12 comments

Using ‘radioactive data’ to detect if a data set was used for training

12 comments