TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Outlier Detection in SQL

75 pointsby vive-la-liberteover 9 years ago

7 comments

thomover 9 years ago
I&#x27;ve recently been doing a bunch of stuff with sports stats, which involves lots of GIS data. This sort of thing comes up a lot - trying to find a player&#x27;s &#x27;territory&#x27; based on coordinates of their actions in a game, without including outlying events that cause you to overestimate the area.<p>There&#x27;s a concept in animal behaviour called a &#x27;home range&#x27; which is more or less the same thing - GPS attached to tigers in the wild etc. Some of the algorithms there are quite interesting, from simply drawing a bounding box around the the data points, to working out the probability density, to things like LoCoH, which sort of recursively build up convex hulls from nearest neighbours.<p>All of these things are pretty much possible in SQL to one degree of performance or another. But ultimately I&#x27;m fascinated by things like SQL Server&#x27;s R support - you can get far simpler, more natural implementations of these things in R (or indeed in custom aggregates or functions in other languages). I think in the long term, database engines that offer this sort of extensibility are going to thrive for analytics work, be they SQL based or otherwise.
评论 #10917891 未加载
评论 #10918020 未加载
cwyersover 9 years ago
Is it just me or is Periscope allergic to saying what databases it supports? I can find nothing on the website, and I am really hesitant to sign up for more info once companies started actually calling my phone after I did that.<p>Also, none of those queries will work on MySQL.
评论 #10915895 未加载
评论 #10915392 未加载
评论 #10915567 未加载
评论 #10915480 未加载
评论 #10915449 未加载
评论 #10915419 未加载
评论 #10915774 未加载
et2oover 9 years ago
I would like to know about more sophisticated techniques for outlier detection. These are stat 101 level. Z scores? You can get into a lot of trouble assuming a normal distribution.<p>What are credit card companies doing? What&#x27;s the best way to combine multiple variables that are predictive of an event for outlier detection? Is there a simple framework to automate reporting of these events in real-time?<p>One way to model this might be to treat the outcome (&quot;Is this event an outlier?&quot;) as a 0&#x2F;1 variable and use one of the many ways to model that type of data–random forests, logistic regression, neural networks, etc. The problem is that this isn&#x27;t really &quot;outlier&quot; detection anymore.
评论 #10916666 未加载
numlockedover 9 years ago
These are some cool SQL tricks! I like it.<p>The big caveat with the standard deviation technique is that it assumes a normal distribution. Many datasets are not actually distributed normally (power-law, Poisson, beta, etc, etc) and so the technique won&#x27;t work. It&#x27;s a much harder problem to &#x27;generically&#x27; detect outliers without knowledge of the underlying distribution.<p>I don&#x27;t have any idea how to do it (though a former colleague came up with nice idea of building a histogram and searching for values that occurred after some number of empty bins, implying an outlier). Is there an accepted state-of-the-art for general-purpose outlier detection? Or is that such a broad question as to be meaningless?
评论 #10916254 未加载
评论 #10917911 未加载
评论 #10916708 未加载
darkxanthosover 9 years ago
Standard deviations assume normal distributions... but ultimately when you look for some two standard deviations from the mean, you&#x27;re just look for an event that only occurs 95% of the time. A technique that works regardless of distribution is percentiles. Postgesql supports these now as well.
评论 #10917904 未加载
tmandryover 9 years ago
Maybe I&#x27;m just unfamiliar with some of the Postgres syntax.. but shouldn&#x27;t there be a GROUP BY in the first query?
评论 #10916298 未加载
sqldbaover 9 years ago
More please.
评论 #10915938 未加载