TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Show HN: Anon – A Unix Command to Anonymise Data

79 pointsby xomateixalmost 7 years ago

10 comments

motohagiographyalmost 7 years ago
De-identification of data sets (like cryptography) is a very difficult problem.<p>It is great that people are building tools for this. Even if I were skeptical of one or another in particular, the availability of tools popularizes the discussion of what is necessary and sufficient for de-identifying data.<p>The main use case I worked on was how to test an event driven (SOA at the time) pipeline without production data. Health information handling is very tightly regulated, so generating a test data set large enough that reflected the needs of the system was a significant challenge. Engineers couldn&#x27;t just copy some production data and use it for testing. The regime I worked in that defined these rules (early PHIPA, PIPEDA in Ontario) is not unlike what people may encounter with GDPR.<p>When I was doing this sort of work, I found that it made more sense to find the structure of the data, then synthesize it from scratch. For a data format like HL7, this is non-trivial.<p>Synthesizing a few gigabytes of json&#x2F;xml&#x2F;text from a small training corpus provides incomplete test data. There are a few companies in the de-identification business, and I remember a few consulting services for it.<p>I can think of a few ways to do this, and they aren&#x27;t simple.
Cynddlalmost 7 years ago
How does this tool compare to other (libre) anonymization software programs, such as ARX [1]? From what I understand, there are only basic routines so to sample records and coarsen a few attributes (e.g. ZIP code, dates) implemented so far.<p>This might also not be sufficient to truly anonymize data, as a large body of research has shown so far [2,3,4]<p>[1] <a href="https:&#x2F;&#x2F;arx.deidentifier.org" rel="nofollow">https:&#x2F;&#x2F;arx.deidentifier.org</a><p>[2] <a href="https:&#x2F;&#x2F;www.uclalawreview.org&#x2F;pdf&#x2F;57-6-3.pdf" rel="nofollow">https:&#x2F;&#x2F;www.uclalawreview.org&#x2F;pdf&#x2F;57-6-3.pdf</a><p>[3] <a href="http:&#x2F;&#x2F;randomwalker.info&#x2F;publications&#x2F;no-silver-bullet-de-identification.pdf" rel="nofollow">http:&#x2F;&#x2F;randomwalker.info&#x2F;publications&#x2F;no-silver-bullet-de-id...</a><p>[4] <a href="http:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;1712.05627" rel="nofollow">http:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;1712.05627</a>
评论 #17144702 未加载
评论 #17145034 未加载
JackChalmost 7 years ago
The intent behind this tool seems good, but I don&#x27;t think it&#x27;s a good idea. To actually anonymize data requires semantic understanding of that data and an understanding of what sort of data, harmless by itself, is transmuted into identifying data when provided in the context of other otherwise harmless data.<p>This tool doesn&#x27;t help you with any of that. It seems to be a glorified awk script. My concern is that helping the user with the <i>easiest</i> part of anonymizing data stands to encourage the user to go full steam ahead without slowing down to stop and think very carefully about what they&#x27;re doing.
评论 #17144814 未加载
pdkl95almost 7 years ago
&gt; anonymising ... columns until the output is useful for applications where sensitive information cannot be exposed<p>This tool will not provide any significant amount of anonymity.<p>&gt; rows to randomly sample ... hash (using ... 32 bits) the column ... mod the result by the [constant] value<p>This is not random. It deterministically selects the same very predictable fraction of rows.<p>&gt; UK format postcode (eg. W1W 8BE) and just keeps the outcode (eg. W1W)<p>&gt; Given a date, just keep the year<p>Partial postal codes and dates quantized to the year are still <i>very</i> revealing. Combined with other data (such as a hashed name), the partial postal code may allow a lot of people to be uniquely identified.<p>&gt; Hash (SHA1) the input<p><i>Hashing does not provide anonymity. Substituting a candidate key with the hash of the key is usually a 1-to-1 map that is often trivial to reverse. It isn&#x27;t hard to iterate through e.g. all possible names, postal codes, license plates, or other short-ish strings to find a matching SHA1.<p><a href="https:&#x2F;&#x2F;arstechnica.com&#x2F;tech-policy&#x2F;2014&#x2F;06&#x2F;poorly-anonymized-logs-reveal-nyc-cab-drivers-detailed-whereabouts&#x2F;" rel="nofollow">https:&#x2F;&#x2F;arstechnica.com&#x2F;tech-policy&#x2F;2014&#x2F;06&#x2F;poorly-anonymize...</a><p>The salt </i>might* provide some resistance to per-computed tables, but a GeForce GTX 1080 Ti running hashcat can search for matching SHA1 at over 11 GH&#x2F;s (giga-hashes per second). That means that a single 1080 Ti running for ~3-4 hours would not only discover <i>not only</i> that SHA1(&quot;hasselhof&quot;) == ffe3294fad149c2dd3579cb864a1aebb2201f38d; it would exhaustively search all 10 character or smaller lowercase strings.<p>&gt; range<p>This is the only feature that could provide anonymity, if it is used correctly to group large numbers of individuals into the same bucket. This is probably more difficult that it first appears.
评论 #17146843 未加载
评论 #17146329 未加载
magissimaalmost 7 years ago
Json for a config that&#x27;s intended to be used by humans is an abomination.
评论 #17146717 未加载
评论 #17145924 未加载
评论 #17145953 未加载
simlevesquealmost 7 years ago
Good idea. You should add a preview of before and after the anonymisation.
评论 #17145047 未加载
stepik777almost 7 years ago
Why is it a UNIX tool? What makes it UNIX? Would it not work on e.g. Windows?
评论 #17146733 未加载
unhammeralmost 7 years ago
Slightly related: Metadata Anonymisation Toolkit <a href="https:&#x2F;&#x2F;mat.boum.org&#x2F;" rel="nofollow">https:&#x2F;&#x2F;mat.boum.org&#x2F;</a> (which seems to be in need of contributors)
Tepixalmost 7 years ago
I&#x27;m surprised it doesn&#x27;t support anonymisation of IP adresses. That would be pretty much the first feature I&#x27;d implement.
评论 #17146741 未加载
qopalmost 7 years ago
Now that homomorphic encryption exists, why is data anonymization still a desired thing?
评论 #17151555 未加载