TechEcho

7 comments

thomover 9 years ago

I've recently been doing a bunch of stuff with sports stats, which involves lots of GIS data. This sort of thing comes up a lot - trying to find a player's 'territory' based on coordinates of their actions in a game, without including outlying events that cause you to overestimate the area.There's a concept in animal behaviour called a 'home range' which is more or less the same thing - GPS attached to tigers in the wild etc. Some of the algorithms there are quite interesting, from simply drawing a bounding box around the the data points, to working out the probability density, to things like LoCoH, which sort of recursively build up convex hulls from nearest neighbours.All of these things are pretty much possible in SQL to one degree of performance or another. But ultimately I'm fascinated by things like SQL Server's R support - you can get far simpler, more natural implementations of these things in R (or indeed in custom aggregates or functions in other languages). I think in the long term, database engines that offer this sort of extensibility are going to thrive for analytics work, be they SQL based or otherwise.

评论 #10917891 未加载

评论 #10918020 未加载

cwyersover 9 years ago

Is it just me or is Periscope allergic to saying what databases it supports? I can find nothing on the website, and I am really hesitant to sign up for more info once companies started actually calling my phone after I did that.Also, none of those queries will work on MySQL.

评论 #10915895 未加载

评论 #10915392 未加载

评论 #10915567 未加载

评论 #10915480 未加载

评论 #10915449 未加载

评论 #10915419 未加载

评论 #10915774 未加载

et2oover 9 years ago

I would like to know about more sophisticated techniques for outlier detection. These are stat 101 level. Z scores? You can get into a lot of trouble assuming a normal distribution.What are credit card companies doing? What's the best way to combine multiple variables that are predictive of an event for outlier detection? Is there a simple framework to automate reporting of these events in real-time?One way to model this might be to treat the outcome ("Is this event an outlier?") as a 0/1 variable and use one of the many ways to model that type of data–random forests, logistic regression, neural networks, etc. The problem is that this isn't really "outlier" detection anymore.

评论 #10916666 未加载

numlockedover 9 years ago

These are some cool SQL tricks! I like it.The big caveat with the standard deviation technique is that it assumes a normal distribution. Many datasets are not actually distributed normally (power-law, Poisson, beta, etc, etc) and so the technique won't work. It's a much harder problem to 'generically' detect outliers without knowledge of the underlying distribution.I don't have any idea how to do it (though a former colleague came up with nice idea of building a histogram and searching for values that occurred after some number of empty bins, implying an outlier). Is there an accepted state-of-the-art for general-purpose outlier detection? Or is that such a broad question as to be meaningless?

评论 #10916254 未加载

评论 #10917911 未加载

评论 #10916708 未加载

darkxanthosover 9 years ago

Standard deviations assume normal distributions... but ultimately when you look for some two standard deviations from the mean, you're just look for an event that only occurs 95% of the time. A technique that works regardless of distribution is percentiles. Postgesql supports these now as well.

评论 #10917904 未加载

tmandryover 9 years ago

Maybe I'm just unfamiliar with some of the Postgres syntax.. but shouldn't there be a GROUP BY in the first query?

评论 #10916298 未加载

sqldbaover 9 years ago

More please.

评论 #10915938 未加载

7 comments

thomover 9 years ago

评论 #10917891 未加载

评论 #10918020 未加载

cwyersover 9 years ago

评论 #10915895 未加载

评论 #10915392 未加载

评论 #10915567 未加载

评论 #10915480 未加载

评论 #10915449 未加载

评论 #10915419 未加载

评论 #10915774 未加载

et2oover 9 years ago

评论 #10916666 未加载

numlockedover 9 years ago

评论 #10916254 未加载

评论 #10917911 未加载

评论 #10916708 未加载

darkxanthosover 9 years ago

评论 #10917904 未加载

tmandryover 9 years ago

Maybe I'm just unfamiliar with some of the Postgres syntax.. but shouldn't there be a GROUP BY in the first query?

评论 #10916298 未加载

sqldbaover 9 years ago

More please.

评论 #10915938 未加载

Outlier Detection in SQL

7 comments

Outlier Detection in SQL

7 comments