Artificial data give the same results as real data without compromising privacy

99 pointsby sibmikeabout 7 years ago

16 comments

Cynddlabout 7 years ago

I'm highly dubious of the ability for synthetic data to model accurately datasets without introducing unexpected bias, esp. to account for causality.If you dig through the original paper, the conclusion is on the line with that:“For 7 out of 15 comparisons, we found no significant difference between the accuracy of features developed on the control dataset vs. those developed on some version of the synthesized data; that is, the result of the test was False.”So, on the tests they developed, the proposed method doesn't work 8 times out of 15…

评论 #16618163 未加载

评论 #16618847 未加载

评论 #16618113 未加载

评论 #16621964 未加载

评论 #16618908 未加载

srikuabout 7 years ago

I haven't read the original paper (yet), but something doesn't sit right with the work, if the way it is portrayed is indeed faithful to it and I'm not missing something important.- It looks like the work of the data scientists will be limited to the extent of the modeling already done by recursive conditional parameter aggregation. (edit: So why not just ship that model and adapt it instead of using it to generate data?)- Its "validation" appears to be doubly proxied - i.e. the normal performance measures we use are themselves a proxy, and now we're comparing those against these performance measures derived from models built out of the data generated by these models. I'm not inclined to trust a validation that is so removed.Any one who can explain this well?

评论 #16618561 未加载

评论 #16617659 未加载

mehrdadnabout 7 years ago

On a parallel note, search for "thresholdout". It's another (genius, I think) way to "stretch" how far your data goes in training a model. I won't do a better job trying to explain it than those who already have, so I won't try—here's a nice link explaining it instead: <a href="http://andyljones.tumblr.com/post/127547085623/holdout-reuse" rel="nofollow">http://andyljones.tumblr.com/post/127547085623/holdout-reuse</a>

评论 #16619698 未加载

lokopodiumabout 7 years ago

They use real data to create artificial data. So, real data is still more useful.

评论 #16616664 未加载

pavonabout 7 years ago

If I was responsible for protecting privacy of data, I don't know that I would be comfortable with this method. Anonymization of data is hard, and frequently turns out to be not as anonymous as originally thought. At a high level, this sounds like they are training a ML system on your data, and then using it to generate similar data. What sort of guarantees can be given that the ML system won't simulate your data with too high of fidelity? I've seen too many image generators that output images very close to the data they were trained on. You could compare the two datasets and look for similarities, but you'd have to have good metrics of what sort of similarity was bad and what sort was good, and I could see that being tricky, in both directions.Although, I suppose that if the data was already anonymized to the best of your ability, and then this was run on top of that, as a additional layer of protection, that might be okay.

lopmotrabout 7 years ago

I wonder how secure it is against identifying individuals. With over-fitting, you can producing the training data as output. Hopefully they have a robust way to prevent that, or any kind of reverse engineering of the output to somehow work out the original data.

sreanabout 7 years ago

Could not get hold of the paper. Are they doing Gibbs sampling or a semiparametric variant of that ?<a href="https://en.wikipedia.org/wiki/Gibbs_sampling" rel="nofollow">https://en.wikipedia.org/wiki/Gibbs_sampling</a>Generating tuples(row) by Gibbs sampling will allow generation of samples from the joint distribution. This in turn would preserve all correlations, conditional probabilities etc. This can be done by starting at a original tuple at random and then repeatedly mutating the tuple by overwriting one of its fields(columns). To overwrite, one selects another random tuple that 'matches' the current one at all positions other than the column selected for overwriting. The match might need to be relaxed from an exact match to a 'close' match.If the conditional distribution for some conditioning event has very low entropy or the conditional entropy is low, one would need to fuzz the original to preserve privacy, but this will come at the expense of distorting the correlations and conditionals.

评论 #16621913 未加载

_0ffhabout 7 years ago

Seems like only helpful for testing methods that can't capture any correlations the original method didn't.

_5659about 7 years ago

Is this akin at all to random sampling with replacement ie bootstrapping?

评论 #16616763 未加载

评论 #16616975 未加载

EGregabout 7 years ago

How is this related to and different from differential privacy?

评论 #16617942 未加载

fardin1368about 7 years ago

I am looking into their experiments. Seems most of them are pretty simple predictions/classifications. No wonder they get good results.The claim is too bold and I would reject this paper They should clarify that the data is good enough for linear regression. Not to say there is no difference between real and syn data.

dwheelerabout 7 years ago

The abstract claims there was no difference only 70% of the time. So 30% of the time there was a difference. Unsurprisingly it greatly limits the kind of data analysis that was allowed, which greatly reduces the applicability even if you believe it. I'm pretty dubious of this work anyway.

anon1253about 7 years ago

Heh. I wrote a paper about this a while ago <a href="https://www.liebertpub.com/doi/full/10.1089/bio.2014.0069" rel="nofollow">https://www.liebertpub.com/doi/full/10.1089/bio.2014.0069</a>

aspacemanabout 7 years ago

Does someone have a link to the preprint / arxiv? The link in the story is a 404 (I presume that the paper just hasn't been posted yet or something?)

评论 #16617385 未加载

sandGorgonabout 7 years ago

Sounds very similar to homomorphic encryption, except with no compromise in performance.I wonder if this the technique behind Numerai

bschreckabout 7 years ago

The link to the actual paper is now working