Debunking the Myth of "Anonymous" Data

195 pointsby gslinover 1 year ago

14 comments

willsethover 1 year ago

I work for a data privacy startup, and this article unfortunately groups all forms of anonymization together. It is specifically criticizing forms of anonymization that only treat direct identifiers like names, addresses, and phone numbers. That is usually referred to as "pseudonymization", and they are correct to point out that an only moderately sophisticated attacker can still link people in the dataset using combinations of indirect identifiers like age, birthday, and zipcode. Pseudonymization is a weak form of privacy.More sophisticated methods of privacy also anonymize indirect identifiers, and in some cases personal attributes. They do this by adding noise to the data in such a way that the noise has relatively* minimal impact on the results of computations made over the dataset, but a significant impact on the ability to re-identify someone using indirect identifiers or attributes.*There is always a tradeoff between privacy and utility. The only way to achieve 100% private data is 100% noise, but the privacy-utility tradeoff curve isn't linear, and you can still achieve very good utility and very good privacy in many cases, especially with the best tools. Methods are also improving over time, reducing the impact of the tradeoff.

评论 #38234327 未加载

评论 #38234744 未加载

评论 #38235326 未加载

评论 #38236976 未加载

评论 #38235481 未加载

评论 #38237391 未加载

评论 #38237942 未加载

评论 #38238154 未加载

评论 #38234333 未加载

euniceee3over 1 year ago

Great article putting all the relevant content in one place. Does anyone know of any de-anonymization services? The startup I am working at is privacy focused and we are looking for a way to demonstrate why you need an additional layer to protect and compartmentalize.Short of us buying up data in bulk and then doing the de-anonymization in-house I am not seeing an easy way to do this. Or even an advertised partner, seems like all the articles are really careful to not do free marketing for companies in this space.

评论 #38233920 未加载

评论 #38233950 未加载

评论 #38237951 未加载

评论 #38236197 未加载

jruohonenover 1 year ago

A good popular take, but they, either intentionally or out of ignorance, omit newer, proven techniques like differential privacy.

评论 #38233221 未加载

评论 #38237239 未加载

评论 #38234568 未加载

评论 #38233036 未加载

评论 #38233177 未加载

评论 #38233416 未加载

ndrover 1 year ago

"De-identified data isn't. Either it's not de-identified, or it's not data anymore."-- Cynthia Dwork, one of the invertors of Differential Privacy, Knuth Prize and Gödel Prize winner.<a href="https://youtu.be/RWpG0ag6j9c?feature=shared&t=274" rel="nofollow noreferrer">https://youtu.be/RWpG0ag6j9c?feature=shared&t=274</a>

jdietrichover 1 year ago

GDPR specifically mentions pseudonymous data in Recital 26:"The principles of data protection should apply to any information concerning an identified or identifiable natural person. Personal data which have undergone pseudonymisation, which could be attributed to a natural person by the use of additional information should be considered to be information on an identifiable natural person. To determine whether a natural person is identifiable, account should be taken of all the means reasonably likely to be used, such as singling out, either by the controller or by another person to identify the natural person directly or indirectly."Under EU law, there is no special class of "personally identifying information". Any data that relates to a person or could be related to a person is protected. It isn't enough to just strip the name and SSN field out of the database and call it anonymous, you need to demonstrate that the data couldn't be attributed to anyone through any reasonably practical process.<a href="https://gdpr-info.eu/recitals/no-26/" rel="nofollow noreferrer">https://gdpr-info.eu/recitals/no-26/</a>

评论 #38233972 未加载

评论 #38234040 未加载

评论 #38233477 未加载

jmatthewsover 1 year ago

I like to make an appearance in threads in this domain and just say, yeah, it's not just possible but "fun" to put the puzzle pieces together.It is a very tough problem to solve. Especially when you consider the richness of the datasets you use to put the pieces together.In my experience the only effective means is to poison the data, in addition to the common sense steps mentioned here.Poisoning a dataset means seeding a wide variety of the datasets you use to discover PII with fictional look alikes that resist debunking.Additionally you can poison the core set if you are very clever about it.

hasolejuover 1 year ago

Anonymization of personal data is a very tricky thing. It is so tricky that the GDPR even doesn't really mention it. I even don't know of any official SOP for doing it.Of course k-anonymity and differential privacy can help you with a closed data set, but once you add records to a dataset over time everything breaks down.I once tried to find a way to anonymize data on different clients and then connect the records of the same entity on the server. This method had the main problem that one could artificially generate a new non-identifying data point for an individual and then see in which record it will end up in the central data base.

yoaviramover 1 year ago

More about why location information IS personal information: <a href="https://consciousdigital.org/location-data-is-personally-identifiable-information/" rel="nofollow noreferrer">https://consciousdigital.org/location-data-is-personally-ide...</a>

frankygover 1 year ago

I don't think humanity is going to get around disclosing how exactly all that data flows into marketing, UX design and policies. Not in the form of pop-science books or blockbuster documentaries but detailed statistics and open reports by the companies themselves.

hinkleyover 1 year ago

There's anonymizing data so that creepy guy two floors down can't stalk the barista across the street, and there's anonymizing data so that you can publish it to the world.Barely even comparable.

lsh123over 1 year ago

As others pointed out, the article mixes a lot of things together. EU (GDPR) has very specific and very hard to meet anonymization bar (tldr; it requires anonymization to be at the level where it’s mathematically improbable to de-anonymize the user). None of the “anonymization” examples in the article would pass this EU bar.

评论 #38248939 未加载

评论 #38238747 未加载

goodboyjojoover 1 year ago

data is never anonymous. if you get enough of it you can correlate who is who and turn it into information on the user.

spiffytechover 1 year ago

Every time I hear "anonymous data", I think of that time AOL published anonymized search logs (for academic research). The anonymization was negligent, and an NYT reporter de-anonymized and tracked down one of the users with the local & personal info present in the search queries.<a href="https://en.wikipedia.org/wiki/AOL_search_log_release" rel="nofollow noreferrer">https://en.wikipedia.org/wiki/AOL_search_log_release</a><a href="https://web.archive.org/web/20130404175032/http://www.nytimes.com/2006/08/09/technology/09aol.html?_r=1" rel="nofollow noreferrer">https://web.archive.org/web/20130404175032/http://www.nytime...</a>

评论 #38233975 未加载

评论 #38238283 未加载

PeterStuerover 1 year ago

Reminds me about the telco offer to sell gps location traces on anyone you wanted, but to ensure 'anonymity' you had to order a batch of 30 people minimum and you did not get information on which trace belongen to who in that set.Figuring out which of those traces belonged to which employee was a real puzzle /s