I've recently been doing a bunch of stuff with sports stats, which involves lots of GIS data. This sort of thing comes up a lot - trying to find a player's 'territory' based on coordinates of their actions in a game, without including outlying events that cause you to overestimate the area.<p>There's a concept in animal behaviour called a 'home range' which is more or less the same thing - GPS attached to tigers in the wild etc. Some of the algorithms there are quite interesting, from simply drawing a bounding box around the the data points, to working out the probability density, to things like LoCoH, which sort of recursively build up convex hulls from nearest neighbours.<p>All of these things are pretty much possible in SQL to one degree of performance or another. But ultimately I'm fascinated by things like SQL Server's R support - you can get far simpler, more natural implementations of these things in R (or indeed in custom aggregates or functions in other languages). I think in the long term, database engines that offer this sort of extensibility are going to thrive for analytics work, be they SQL based or otherwise.
Is it just me or is Periscope allergic to saying what databases it supports? I can find nothing on the website, and I am really hesitant to sign up for more info once companies started actually calling my phone after I did that.<p>Also, none of those queries will work on MySQL.
I would like to know about more sophisticated techniques for outlier detection. These are stat 101 level. Z scores? You can get into a lot of trouble assuming a normal distribution.<p>What are credit card companies doing? What's the best way to combine multiple variables that are predictive of an event for outlier detection? Is there a simple framework to automate reporting of these events in real-time?<p>One way to model this might be to treat the outcome ("Is this event an outlier?") as a 0/1 variable and use one of the many ways to model that type of data–random forests, logistic regression, neural networks, etc. The problem is that this isn't really "outlier" detection anymore.
These are some cool SQL tricks! I like it.<p>The big caveat with the standard deviation technique is that it assumes a normal distribution. Many datasets are not actually distributed normally (power-law, Poisson, beta, etc, etc) and so the technique won't work. It's a much harder problem to 'generically' detect outliers without knowledge of the underlying distribution.<p>I don't have any idea how to do it (though a former colleague came up with nice idea of building a histogram and searching for values that occurred after some number of empty bins, implying an outlier). Is there an accepted state-of-the-art for general-purpose outlier detection? Or is that such a broad question as to be meaningless?
Standard deviations assume normal distributions... but ultimately when you look for some two standard deviations from the mean, you're just look for an event that only occurs 95% of the time. A technique that works regardless of distribution is percentiles. Postgesql supports these now as well.