I'm highly dubious of the ability for synthetic data to model accurately datasets without introducing unexpected bias, esp. to account for causality.<p>If you dig through the original paper, the conclusion is on the line with that:<p><i>“For 7 out of 15 comparisons, we found no significant difference between the accuracy of features developed on the control dataset vs. those developed on some version of the synthesized data; that is, the result of the test was False.”</i><p>So, on the tests they developed, the proposed method doesn't work 8 times out of 15…
I haven't read the original paper (yet), but something doesn't sit right with the work, if the way it is portrayed is indeed faithful to it and I'm not missing something important.<p>- It looks like the work of the data scientists will be limited to the extent of the modeling already done by recursive conditional parameter aggregation. (edit: So why not just ship that model and adapt it instead of using it to generate data?)<p>- Its "validation" appears to be doubly proxied - i.e. the normal performance measures we use are themselves a proxy, and now we're comparing those against these performance measures derived from models built out of the data generated by these models. I'm not inclined to trust a validation that is so removed.<p>Any one who can explain this well?
On a parallel note, search for "thresholdout". It's another (genius, I think) way to "stretch" how far your data goes in training a model. I won't do a better job trying to explain it than those who already have, so I won't try—here's a nice link explaining it instead: <a href="http://andyljones.tumblr.com/post/127547085623/holdout-reuse" rel="nofollow">http://andyljones.tumblr.com/post/127547085623/holdout-reuse</a>
If I was responsible for protecting privacy of data, I don't know that I would be comfortable with this method. Anonymization of data is hard, and frequently turns out to be not as anonymous as originally thought. At a high level, this sounds like they are training a ML system on your data, and then using it to generate similar data. What sort of guarantees can be given that the ML system won't simulate your data with too high of fidelity? I've seen too many image generators that output images very close to the data they were trained on. You could compare the two datasets and look for similarities, but you'd have to have good metrics of what sort of similarity was bad and what sort was good, and I could see that being tricky, in both directions.<p>Although, I suppose that if the data was already anonymized to the best of your ability, and then this was run on top of that, as a additional layer of protection, that might be okay.
I wonder how secure it is against identifying individuals. With over-fitting, you can producing the training data as output. Hopefully they have a robust way to prevent that, or any kind of reverse engineering of the output to somehow work out the original data.
Could not get hold of the paper. Are they doing Gibbs sampling or a semiparametric variant of that ?<p><a href="https://en.wikipedia.org/wiki/Gibbs_sampling" rel="nofollow">https://en.wikipedia.org/wiki/Gibbs_sampling</a><p>Generating tuples(row) by Gibbs sampling will allow generation of samples from the joint distribution. This in turn would preserve all correlations, conditional probabilities etc. This can be done by starting at a original tuple at random and then repeatedly mutating the tuple by overwriting one of its fields(columns). To overwrite, one selects another random tuple that 'matches' the current one at all positions other than the column selected for overwriting. The match might need to be relaxed from an exact match to a 'close' match.<p>If the conditional distribution for some conditioning event has very low entropy or the conditional entropy is low, one would need to fuzz the original to preserve privacy, but this will come at the expense of distorting the correlations and conditionals.
I am looking into their experiments. Seems most of them are pretty simple predictions/classifications. No wonder they get good results.<p>The claim is too bold and I would reject this paper
They should clarify that the data is good enough for linear regression. Not to say there is no difference between real and syn data.
The abstract claims there was no difference only 70% of the time. So 30% of the time there was a difference. Unsurprisingly it greatly limits the kind of data analysis that was allowed, which greatly reduces the applicability even if you believe it. I'm pretty dubious of this work anyway.
Heh. I wrote a paper about this a while ago <a href="https://www.liebertpub.com/doi/full/10.1089/bio.2014.0069" rel="nofollow">https://www.liebertpub.com/doi/full/10.1089/bio.2014.0069</a>
Does someone have a link to the preprint / arxiv? The link in the story is a 404 (I presume that the paper just hasn't been posted yet or something?)