The False Allure of Hashing for Anonymization

200 pointsby twakefieldabout 7 years ago

26 comments

aidosabout 7 years ago

I saw a case a few years ago where the management of a company I knew were worried that the sales team were covering their mistakes and lying about it to blame the (external) dev team's code. They asked me to take a look into it one morning.At first glance there didn't seem to be a lot to go on. There was no auditing in the application itself so I focused on the nginx logs. It's amazing how clear of a picture you can create from ip addresses, user agent strings and accessed urls.Within an hour I could say with a high degree of certainty that the story was something like:<pre><code> Sales rep makes mistake with record on Friday afternoon Monday morning - at home, late for work Receives call from another rep re mistake Logs in via mobile device to see the issue Logs in via desktop to fix broken record Arrives at work 1.5 hours later Claims dev team had broken the record for the weekend </code></pre> There's a lot of information lurking in log files (let alone insecure dbs), and that's just the tip of the iceberg of what's stored these days. I dread to think how much personal information is stored in some of the bigger CRM apps these days.Quite frankly I'm glad there's a push to start thinking about this stuff from the outset at the moment.

评论 #16964547 未加载

michaelbuckbeeabout 7 years ago

In digital security there is the concept of "defense in depth", that no one product, feature, approach or safeguard is going to magically make you protected from attacks. What's required are multiple overlapping layers of protection that collectively work together to create a more protected whole.We're seeing more of this with privacy and user data. The author very correctly points out some issues with hashing and "pure" anonymization. It's more correctly considered "pseudonymization" (which is a recommended GDPR technique [1]).All of which is to say _it's still an improvement over nothing_ and when layered with other techniques can help protect user privacy.1 - <a href="https://blog.varonis.com/gdpr-requirements-list-in-plain-english/#article25" rel="nofollow">https://blog.varonis.com/gdpr-requirements-list-in-plain-eng...</a>

评论 #16963800 未加载

kevin_nisbetabout 7 years ago

Author Here.Using crypto hashes to anonymize data is one of those mistakes I've seen several times, and wanted to draw some attention to the issue so that hopefully we can all learn from it.Let me know if you have any questions.

评论 #16961316 未加载

评论 #16961306 未加载

评论 #16961812 未加载

评论 #16961399 未加载

评论 #16962784 未加载

评论 #16961584 未加载

russnewcomerabout 7 years ago

This is a question that I've thought of recently, as I am going to be working with a set of data that is the kind of data that may have damaging personal repercussions if identified with you but is good for society as a whole to be tracking, but that tracking doesn't have to be personally identifiable. Something like, it could be bad for me if it was revealed to my insurance company that I drove more than 5000 miles a year on a motorcycle, but beneficial for society as a whole to understand accident rates for high mileage motorcycle drivers. Do you have any thoughts/resources on how one could go about creating a privacy environment where users could input how many miles they drove, and where we have reporting that analyzes that information they put in? My first thought had been hashing primary keys, but as you point out in your article, that obviously isn't the best answer.

评论 #16962704 未加载

PeterisPabout 7 years ago

SHA256 pretty much ensures that you have a unique hash for every value - and that's a feature you don't want for anonymization. So why not simply take the first few bytes of a SHA256, a small enough set to ensure that collisions not only might happen but will happen? I mean, that's a required feature to ensure anonymization, not just pseudonymization - if you can select a whole trail of events for ID #123 and be sure that these represent all the events for some (unknown) real user, then that by itself means that those events aren't anonymous, they're pseudonymous.You can tweak the hash length so that whatever statistics you run out of the hashed data are meaningful (though not exact) despite the collisions, but that running a dictionary attack of plausible usernames returns an overwhelming amount of false positives.

评论 #16964427 未加载

jacquesmabout 7 years ago

The idea that data is a corporate asset has to die. Data is a corporate liability.

评论 #16962821 未加载

评论 #16962330 未加载

评论 #16961895 未加载

procrastinatusabout 7 years ago

Differential privacy seems like a pretty good approach to this problem. <a href="https://machinelearning.apple.com/2017/12/06/learning-with-privacy-at-scale.html" rel="nofollow">https://machinelearning.apple.com/2017/12/06/learning-with-p...</a>

评论 #16961573 未加载

评论 #16964293 未加载

Area12about 7 years ago

I am not a crypto expert, but I thought that the idea was to produce a new more or less random salt for EACH password, store the salt with the hashed password, hashing using an expensive algorithm. Yes the hacker steals the salt with the hash, but now has to go to the trouble of brute forcing that ONE password with its UNIQUE (or almost unique) salt. In other words, the hacker can crack it, but the process is so expensive for ONE password that cracking an entire database of passwords is a nightmare. Of course, the hacker just focuses on the most privileged accounts I guess, but the idea is to make the hackers life as unpleasant as possible, and to catch the hacker while they are coming back in. Am I missing the point? I do see that if the hacker wants one password, they can do with effort even with unique salts.

评论 #16964134 未加载

评论 #16964292 未加载

评论 #16964143 未加载

评论 #16964118 未加载

kurthrabout 7 years ago

I'm surprised that there was no mention of a salt used in a secure server to generate the hashes and act as an oracle. Adding pepper at the customer site already seemed like a good idea. Of course this is still hard and requires diligence for those who care about their customers and data security.

评论 #16961725 未加载

lolcabout 7 years ago

The trouble is when we're holding on to the original data because we want the option to process it in new ways later on. The fundamental problem is that data correlates facts. Thus - as the article rightly points out - if you know some of the facts you can reconstruct identities.I find the distinction between information and exformation revealing: Information is the bits we gleaned from the data, exformation is the bits we discarded while reducing the data. The efficacy of an information processing system is in how much it discards while extracting the information we need. The expensive operation is not the recording but the forgetting.If you want to protect data from being stolen, distill it as soon as possible into the information you need. And destroy the rest. It comes down to the value of being able to re-run the analysis versus the effort to guard the data.

评论 #16961507 未加载

voidmainabout 7 years ago

"Anonymization" in the sense of transforming a dataset so that it's still useful but doesn't significantly reduce the privacy of the people it describes, is usually impossible, or at least beyond the state of the art. People start out with just a few tens of bits of anonymity and bits are everywhere.You probably have a better chance of creating your own secure block cipher than of achieving this goal. In a similar way, your inability to see what's wrong with your scheme is not evidence that it works.I don't like to be negative, and I'm all for continued research, but at this point the conservative thing to do with data that you need to "anonymize" is delete it.

评论 #16962977 未加载

kahnjwabout 7 years ago

The author makes a good point, anonymizing data is hard. Unfortunately they don't mention differential privacy, a promising area of research that can help us solve these problems.<a href="https://en.wikipedia.org/wiki/Differential_privacy" rel="nofollow">https://en.wikipedia.org/wiki/Differential_privacy</a>

djhworldabout 7 years ago

Where I work we've been debating about this a lot. I work with log data from CDNs, so user IP addresses get ingested. We use that information and correlate it with geoip services to determine stuff like the ISP being used.This is so we can evaluate CDN performance and also see how well ISPs are doing in serving content to the user. So it's essentially asking questions about network performance rather than at a macro level of individual users.As far as IPs are concerned we don't care much after that, other than maybe the odd "how many unique IP addresses were served today" type queries.We've talked about doing the secret/salt that is rotated periodically, but to be safe you would definitely need to ensure previous salts are destroyed, and not even let people view them or access them when they are live.

评论 #16962607 未加载

评论 #16962291 未加载

javajoshabout 7 years ago

When addressing the solution of adding data (salt) I find the authors counter-argument unconvincing:<pre><code> Don’t get me wrong, this does make it significantly harder to attack a leaked database to unmask every user, but the resources required to do so or target specific users are within the reach of many adversaries. </code></pre> I don't see how it's more feasible to reverse hash(known_user+salt) than it is to dereference hash(salt), and even state level actors can't do anything but attempt to brute-force hash(salt). IOW without more behind the author's assertion, I don't buy it that adding more data to the data you want to protect is insufficient protection, even against known targets.

lbrinerabout 7 years ago

The link to Cryptographic Right Answers is really helpful and the kind of article that it would be nice to make the general "go-to" for those of us who know enough but not enough to do it ourselves!What I didn't like was the continual reference to AWS as if it is the only provider available, without qualifying whether it is specifically an AWS product that solves the problem or whether it is an example of using a cloud service to transfer the risk. There are many alternatives to AWS load balancers and Key Management systems, so the advice is tainted sigh

Terr_about 7 years ago

Why not a two-step process, where you (A) generate a hash from fixed user details and (B) use that hash to access a lookup-table for the final UUID? This combines some strengths of both systems:1. Outsiders can't determine an arbitrary UUID, even if they know the original user-details.2. You can easily destroy a relationship (to limit correlation or to comply with laws like GDPR) by erasing the corresponding row in the lookup table.3. Insiders can't directly go backwards from UUID to real-name, due to the hashing step. They would need to generate hashes for all the users, and hope that matches still exist in the lookup table.

JTbaneabout 7 years ago

Surprising that no mention is made of rainbow tables or lookup tables. If you hash something that can easily be looked up in a table, it's obviously not anonymous.Passwords are stored as salted hashes for these obvious reasons...

评论 #16961553 未加载

jiveturkeyabout 7 years ago

Really good article. One of those things that are beyond obvious to those of us close to this field, but not at all obvious to the general software dev (who also mightn't know the difference between a good cryptographic hash and a good password hash).What would make this article great is general ideas on what is a good way to anonymize data. I'm surprised that info is missing, actually.What would make it world class great is discussion about GDPR ramifications, keeping in mind that one need not necessarily be perfect for GDPR, even if you're FB/Google.

评论 #16961670 未加载

AndrewSChapmanabout 7 years ago

If our goal is true anonymisation, that is, even the host cannot know who the data belongs to, why are we hashing data at all, and not completely removing it? Replace the pii (name, email address, phone etc) with a fixed number of *'s. There's no reversing or guessing that.If we are wanting information to be readable by some people in some circumstances, that's not anonymisation: that's data protection and an entirely different problem.

zAy0LfpBZLC8mACabout 7 years ago

I think another problem is that we even call any of that "anonymization". If you replace "foobar" with "1", you haven't anonymized anything. At best, you have pseudomized your data. Whether you use hashing or a secret mapping function, as long as identity within your dataset is preserved, what you are generating are pseudonyms.

zzzcpanabout 7 years ago

> The way we’ve chosen to anonymize the data is by generating HMACYou can also truncate the hash after the HMAC to mix the data of different users. It still would be useful for aggregate analytics, abuse protection, rate limiting, etc, but if each user shares an identifier with many others it would be harder to unmask them and make correlations.

mlinksvaabout 7 years ago

Another recent writeup <a href="https://freedom-to-tinker.com/2018/04/09/four-cents-to-deanonymize-companies-reverse-hashed-email-addresses/" rel="nofollow">https://freedom-to-tinker.com/2018/04/09/four-cents-to-deano...</a>

angry_octetabout 7 years ago

See also, why hashing is not a good way to discover shared contacts, and a better way: <a href="https://signal.org/blog/contact-discovery/" rel="nofollow">https://signal.org/blog/contact-discovery/</a>

tempodoxabout 7 years ago

Any thoughts on using a UUID instead of a username hash?

评论 #16962462 未加载

blattimwindabout 7 years ago

tl;dr preimage resistance is only as strong as sizeof(input domain), which is probably small if you're trying to anonymize something.

slooonzabout 7 years ago

If you don’t require deterministic hashes (and deterministic hashes are bad for anonymization anyway) just hash data+randomBytes(16) (obviously, don't save randomBytes(16) anywhere). There you are, nobody can bruteforce your hashes.Even better, just replace your data with H(randomBytes(16)). Or a random UUID.

评论 #16961414 未加载

评论 #16963487 未加载

评论 #16961796 未加载

评论 #16961305 未加载