I work for a data privacy startup, and this article unfortunately groups all forms of anonymization together. It is specifically criticizing forms of anonymization that only treat direct identifiers like names, addresses, and phone numbers. That is usually referred to as "pseudonymization", and they are correct to point out that an only moderately sophisticated attacker can still link people in the dataset using combinations of indirect identifiers like age, birthday, and zipcode. Pseudonymization is a weak form of privacy.<p>More sophisticated methods of privacy also anonymize indirect identifiers, and in some cases personal attributes. They do this by adding noise to the data in such a way that the noise has relatively* minimal impact on the results of computations made over the dataset, but a significant impact on the ability to re-identify someone using indirect identifiers or attributes.<p>*There is always a tradeoff between privacy and utility. The only way to achieve 100% private data is 100% noise, but the privacy-utility tradeoff curve isn't linear, and you can still achieve very good utility and very good privacy in many cases, especially with the best tools. Methods are also improving over time, reducing the impact of the tradeoff.
Great article putting all the relevant content in one place. Does anyone know of any de-anonymization services? The startup I am working at is privacy focused and we are looking for a way to demonstrate why you need an additional layer to protect and compartmentalize.<p>Short of us buying up data in bulk and then doing the de-anonymization in-house I am not seeing an easy way to do this. Or even an advertised partner, seems like all the articles are really careful to not do free marketing for companies in this space.
"De-identified data isn't. Either it's not de-identified, or it's not data anymore."<p>-- Cynthia Dwork, one of the invertors of Differential Privacy, Knuth Prize and Gödel Prize winner.<p><a href="https://youtu.be/RWpG0ag6j9c?feature=shared&t=274" rel="nofollow noreferrer">https://youtu.be/RWpG0ag6j9c?feature=shared&t=274</a>
GDPR specifically mentions pseudonymous data in Recital 26:<p><i>"The principles of data protection should apply to any information concerning an identified or identifiable natural person. Personal data which have undergone pseudonymisation, which could be attributed to a natural person by the use of additional information should be considered to be information on an identifiable natural person. To determine whether a natural person is identifiable, account should be taken of all the means reasonably likely to be used, such as singling out, either by the controller or by another person to identify the natural person directly or indirectly."</i><p>Under EU law, there is no special class of "personally identifying information". <i>Any</i> data that relates to a person or <i>could be</i> related to a person is protected. It isn't enough to just strip the name and SSN field out of the database and call it anonymous, you need to demonstrate that the data couldn't be attributed to anyone through any reasonably practical process.<p><a href="https://gdpr-info.eu/recitals/no-26/" rel="nofollow noreferrer">https://gdpr-info.eu/recitals/no-26/</a>
I like to make an appearance in threads in this domain and just say, yeah, it's not just possible but "fun" to put the puzzle pieces together.<p>It is a very tough problem to solve. Especially when you consider the richness of the datasets you use to put the pieces together.<p>In my experience the only effective means is to poison the data, in addition to the common sense steps mentioned here.<p>Poisoning a dataset means seeding a wide variety of the datasets you use to discover PII with fictional look alikes that resist debunking.<p>Additionally you can poison the core set if you are very clever about it.
Anonymization of personal data is a very tricky thing. It is so tricky that the GDPR even doesn't really mention it. I even don't know of any official SOP for doing it.<p>Of course k-anonymity and differential privacy can help you with a closed data set, but once you add records to a dataset over time everything breaks down.<p>I once tried to find a way to anonymize data on different clients and then connect the records of the same entity on the server. This method had the main problem that one could artificially generate a new non-identifying data point for an individual and then see in which record it will end up in the central data base.
More about why location information IS personal information:
<a href="https://consciousdigital.org/location-data-is-personally-identifiable-information/" rel="nofollow noreferrer">https://consciousdigital.org/location-data-is-personally-ide...</a>
I don't think humanity is going to get around disclosing how exactly all that data flows into marketing, UX design and policies. Not in the form of pop-science books or blockbuster documentaries but detailed statistics and open reports by the companies themselves.
There's anonymizing data so that creepy guy two floors down can't stalk the barista across the street, and there's anonymizing data so that you can publish it to the world.<p>Barely even comparable.
As others pointed out, the article mixes a lot of things together. EU (GDPR) has very specific and very hard to meet anonymization bar (tldr; it requires anonymization to be at the level where it’s mathematically improbable to de-anonymize the user). None of the “anonymization” examples in the article would pass this EU bar.
Every time I hear "anonymous data", I think of that time AOL published anonymized search logs (for academic research). The anonymization was negligent, and an NYT reporter de-anonymized and tracked down one of the users with the local & personal info present in the search queries.<p><a href="https://en.wikipedia.org/wiki/AOL_search_log_release" rel="nofollow noreferrer">https://en.wikipedia.org/wiki/AOL_search_log_release</a><p><a href="https://web.archive.org/web/20130404175032/http://www.nytimes.com/2006/08/09/technology/09aol.html?_r=1" rel="nofollow noreferrer">https://web.archive.org/web/20130404175032/http://www.nytime...</a>
Reminds me about the telco offer to sell gps location traces on anyone you wanted, but to ensure 'anonymity' you had to order a batch of 30 people minimum and you did not get information on which trace belongen to who in that set.<p>Figuring out which of those traces belonged to which employee was a real puzzle /s