Very smart inquiry...I had been somewhat of skeptic initially that the potential danger to privacy outweighed the value of making the data as transparent as possible, but that's just a guess out of my preconceived notions of taxi use, which were already inadequate as it still blows my mind how many taxi trips there are on an average day.<p>I think an argument can still be made that even if the OP is right about the quantity that can be uniquely identified -- keeping the coordinate data still outweighs the real-life privacy risk, that is, the small number of people who want to hire a private investigator/specialist to analyze this data to catch a specific person would find it much faster to track the person the way that PI's normally do so. But the rebuttal can't simply be, "uniquely identifiable trips are probably so rare as to be inconsequential"
"it turns out that if you know the census tracts for pickups and drop offs, plus pickup times truncated to the nearest hour, then you can uniquely identify 40% of NYC taxi trips"<p>Hmmm... but if you already have those pieces of information (start tract, end tract, start hour) what would you want to get from the data? How much someone paid? How much they tipped? Whether they paid with cash or card?<p>Can anyone see an obvious nefarious use for this data?
I find this kind of analysis to be really awesome and I'd love to learn how to do even a more basic version of it. Does anyone have some resources they can point me to?<p>I'm actually working on a small project that has a much, much smaller dataset than the NYC Taxi data but some similar attributes (geographic coordinates mainly). I'd love to produce something like this with what I find (assuming I can find anything interesting).
> uniquely identified by birthday, gender, and ZIP code<p>This is not correct; you need to say "full birthday" which includes the year, otherwise the statement is nonsense.