TechEcho

10 comments

De-identification of data sets (like cryptography) is a very difficult problem.It is great that people are building tools for this. Even if I were skeptical of one or another in particular, the availability of tools popularizes the discussion of what is necessary and sufficient for de-identifying data.The main use case I worked on was how to test an event driven (SOA at the time) pipeline without production data. Health information handling is very tightly regulated, so generating a test data set large enough that reflected the needs of the system was a significant challenge. Engineers couldn't just copy some production data and use it for testing. The regime I worked in that defined these rules (early PHIPA, PIPEDA in Ontario) is not unlike what people may encounter with GDPR.When I was doing this sort of work, I found that it made more sense to find the structure of the data, then synthesize it from scratch. For a data format like HL7, this is non-trivial.Synthesizing a few gigabytes of json/xml/text from a small training corpus provides incomplete test data. There are a few companies in the de-identification business, and I remember a few consulting services for it.I can think of a few ways to do this, and they aren't simple.

Cynddlalmost 7 years ago

How does this tool compare to other (libre) anonymization software programs, such as ARX [1]? From what I understand, there are only basic routines so to sample records and coarsen a few attributes (e.g. ZIP code, dates) implemented so far.This might also not be sufficient to truly anonymize data, as a large body of research has shown so far [2,3,4][1] <a href="https://arx.deidentifier.org" rel="nofollow">https://arx.deidentifier.org</a>[2] <a href="https://www.uclalawreview.org/pdf/57-6-3.pdf" rel="nofollow">https://www.uclalawreview.org/pdf/57-6-3.pdf</a>[3] <a href="http://randomwalker.info/publications/no-silver-bullet-de-identification.pdf" rel="nofollow">http://randomwalker.info/publications/no-silver-bullet-de-id...</a>[4] <a href="http://arxiv.org/abs/1712.05627" rel="nofollow">http://arxiv.org/abs/1712.05627</a>

评论 #17144702 未加载

评论 #17145034 未加载

JackChalmost 7 years ago

The intent behind this tool seems good, but I don't think it's a good idea. To actually anonymize data requires semantic understanding of that data and an understanding of what sort of data, harmless by itself, is transmuted into identifying data when provided in the context of other otherwise harmless data.This tool doesn't help you with any of that. It seems to be a glorified awk script. My concern is that helping the user with the easiest part of anonymizing data stands to encourage the user to go full steam ahead without slowing down to stop and think very carefully about what they're doing.

评论 #17144814 未加载

pdkl95almost 7 years ago

> anonymising ... columns until the output is useful for applications where sensitive information cannot be exposedThis tool will not provide any significant amount of anonymity.> rows to randomly sample ... hash (using ... 32 bits) the column ... mod the result by the [constant] valueThis is not random. It deterministically selects the same very predictable fraction of rows.> UK format postcode (eg. W1W 8BE) and just keeps the outcode (eg. W1W)> Given a date, just keep the yearPartial postal codes and dates quantized to the year are still very revealing. Combined with other data (such as a hashed name), the partial postal code may allow a lot of people to be uniquely identified.> Hash (SHA1) the inputHashing does not provide anonymity. Substituting a candidate key with the hash of the key is usually a 1-to-1 map that is often trivial to reverse. It isn't hard to iterate through e.g. all possible names, postal codes, license plates, or other short-ish strings to find a matching SHA1.<a href="https://arstechnica.com/tech-policy/2014/06/poorly-anonymized-logs-reveal-nyc-cab-drivers-detailed-whereabouts/" rel="nofollow">https://arstechnica.com/tech-policy/2014/06/poorly-anonymize...</a>The salt might* provide some resistance to per-computed tables, but a GeForce GTX 1080 Ti running hashcat can search for matching SHA1 at over 11 GH/s (giga-hashes per second). That means that a single 1080 Ti running for ~3-4 hours would not only discover not only that SHA1("hasselhof") == ffe3294fad149c2dd3579cb864a1aebb2201f38d; it would exhaustively search all 10 character or smaller lowercase strings.> rangeThis is the only feature that could provide anonymity, if it is used correctly to group large numbers of individuals into the same bucket. This is probably more difficult that it first appears.

评论 #17146843 未加载

评论 #17146329 未加载

magissimaalmost 7 years ago

Json for a config that's intended to be used by humans is an abomination.

评论 #17146717 未加载

评论 #17145924 未加载

评论 #17145953 未加载

simlevesquealmost 7 years ago

Good idea. You should add a preview of before and after the anonymisation.

评论 #17145047 未加载

stepik777almost 7 years ago

Why is it a UNIX tool? What makes it UNIX? Would it not work on e.g. Windows?

评论 #17146733 未加载

unhammeralmost 7 years ago

Slightly related: Metadata Anonymisation Toolkit <a href="https://mat.boum.org/" rel="nofollow">https://mat.boum.org/</a> (which seems to be in need of contributors)

Tepixalmost 7 years ago

I'm surprised it doesn't support anonymisation of IP adresses. That would be pretty much the first feature I'd implement.

评论 #17146741 未加载

qopalmost 7 years ago

Now that homomorphic encryption exists, why is data anonymization still a desired thing?

评论 #17151555 未加载

10 comments

motohagiographyalmost 7 years ago

Cynddlalmost 7 years ago

评论 #17144702 未加载

评论 #17145034 未加载

JackChalmost 7 years ago

评论 #17144814 未加载

pdkl95almost 7 years ago

评论 #17146843 未加载

评论 #17146329 未加载

magissimaalmost 7 years ago

Json for a config that's intended to be used by humans is an abomination.

评论 #17146717 未加载

评论 #17145924 未加载

评论 #17145953 未加载

simlevesquealmost 7 years ago

Good idea. You should add a preview of before and after the anonymisation.

评论 #17145047 未加载

stepik777almost 7 years ago

Why is it a UNIX tool? What makes it UNIX? Would it not work on e.g. Windows?

评论 #17146733 未加载

unhammeralmost 7 years ago

Slightly related: Metadata Anonymisation Toolkit <a href="https://mat.boum.org/" rel="nofollow">https://mat.boum.org/</a> (which seems to be in need of contributors)

Tepixalmost 7 years ago

I'm surprised it doesn't support anonymisation of IP adresses. That would be pretty much the first feature I'd implement.

评论 #17146741 未加载

qopalmost 7 years ago

Now that homomorphic encryption exists, why is data anonymization still a desired thing?

评论 #17151555 未加载

Show HN: Anon – A Unix Command to Anonymise Data

10 comments

Show HN: Anon – A Unix Command to Anonymise Data

10 comments