I've worked extensively with this dataset on a similar project, <a href="http://crashmapper.org" rel="nofollow">http://crashmapper.org</a>, and through that process found that the data is extremely error prone. Perhaps 20% of the collisions recorded are not geocoded (e.g.lack lat, long coordinates) and don't contain other location information such as street, cross street, and zip code that could be used to geocode them. It appears that some precincts of the NYPD do a better job at recording a crash location then others. Even more of the data lacks values for "contributing factors" so it seems difficult to use as a metric for analysis. Often there is a mismatch between the total number of persons injured or killed and the number of pedestrians, cyclists, or motorists injured or killed. Furthermore, whomever maintains this dataset will periodically go back in time and update it seemingly at random, editing existing data or adding new data, potentially months or years back in time. Often it appears to be that the data maintainer is changing values for fields such as the number of pedestrians, cyclists, motorists injured or killed. Presumably this is because more information surfaced about an incident at a later point in time and the city must go back and update it. However this can result in stats from the data not aligning with the NYPD's or DOT's official stats from a previous year. I would advise anyone to keep these facts in mind if trying to use the data for analysis and policy recommendations, such is open data.