As a reminder of how the obscurity of data can make it unused for vital analysis -- and as a corollary, how the statistical analysis doesn't even have to be that sophisticated when you get your hands on obscured data, I frequently point my students to this writeup of how reporters tracked misbehaving police officers in Florida (a state in which it is actually very easy, comparatively speaking, to get this data):<p><a href="http://ire.org/blog/on-the-road/2011/12/20/behind-story-tracking-police/" rel="nofollow">http://ire.org/blog/on-the-road/2011/12/20/behind-story-trac...</a><p>> <i>It is the cleanest set of data I have ever worked with. There was no big clean up with the data. Sometimes you get a data set and find out it has errors or wrong information. Everywhere we turned this data pointed us correctly...When we got the data I spent a week or more playing around with it – sorts and counts, which officers got written up the most number of times. Then I started looking at certain types of offenses, like “he had both a domestic violence and an excessive force.” From that I created a list of 150-200 officers. Once we had a nice healthy pool of targets we tried to find out more details on them by asking for the reports on the incidents.</i><p>> <i>"This was a case where the government had this wonderful, informative dataset and they weren’t using it at all except to compile the information. I remember talking to one person at an office and saying: “How could you guys not know some of this? In five minutes of (SQL) queries you know everything about these officers?” They basically said it wasn't their job. That left a huge opportunity for us."</i><p>via the 538 article, this sentiment is repeated:<p>> <i>The city had the tools to identify and curtail troublesome officers before Kalven pursued legal action. All of the complaints were stored within the department long before the Invisible Institute had access. As Kalven put it, “All the knowledge to transform the system existed within the system.”</i><p>edit: The 538 article is greatly appreciated...I didn't even know of the Citizens Police Data Project, and now I do...but skimming the dataset, I would take a different approach than the one implied in the 538 article title. A good data investigation doesn't have to come up with a statistical model for predicting bad behavior. A simple group-by-count is devastating enough. It is merely enough to show that with a simple aggregation of tabular data, we see that the majority of police officers are doing "fine" -- but that's not the point. The point is that when police officers are <i>egregiously</i> and <i>repeatedly</i> bad, there is apparently no institutional mechanism to root that out. And we have little reason to expect that to change if, God forbid, more and more police officers decide to go bad.<p>In other words, we can argue that Chicago cops are generally good/great. But the fact that we don't have as many cop-problems as we hypothetically <i>could</i> is a matter of luck and faith in human behavior and actions within a bureaucracy...Faith in the human spirit is a nice feel-good thing -- like believing the next shuttle will safely launch even though the Challenger just blew up -- but when it comes to public safety and justice, it is imperative to demand more. If the police loathe releasing disciplinary data (nevermind it being the law, of course), they should consider being a bit more proactive in policing themselves and removing the low-hanging bad apples.