Anomaly detection in graphs is a critical problem for finding suspicious behavior in innumerable systems, such as intrusion detection, fake ratings, and financial fraud. But most of the systems in place focus either on static graphs or on entire graph snapshots if they consider dynamic (time-evolving) graphs.<p>However, to minimize the effect of malicious activities and start recovery as soon as possible, we need to detect anomalies in real-time or near real-time i.e. to identify whether an incoming edge is anomalous or not, as soon as we receive it. In addition, since the number of vertices can increase as we process the stream of edges, we need an algorithm which uses constant memory in graph size. Moreover, fraudulent or anomalous events in many applications occur in microclusters or suddenly arriving groups of suspiciously similar edges e.g. denial of service attacks in network traffic data and lockstep behavior.<p>In this work, we propose MIDAS, which detects microcluster anomalies, or suddenly arriving groups of suspiciously similar edges, in edge streams, using constant time and memory. In addition, by using a principled hypothesis testing framework, MIDAS provides theoretical bounds on the false positive probability, which earlier methods do not provide. Also, we are up to 644 times faster and 48% more accurate than previous state of the art approaches.<p>For more details, read the paper at: <a href="https://www.comp.nus.edu.sg/~sbhatia/assets/pdf/midas.pdf" rel="nofollow">https://www.comp.nus.edu.sg/~sbhatia/assets/pdf/midas.pdf</a><p>If you are in New York, you are welcome to attend the talk at AAAI in New York Hilton Midtown on 11th February at 2pm.
Before I read your paper, I didn't realize the power of hypothesis tests. I was surprised to see theoretical bounds on anomaly detection with such weak constraint through simple Chi-Squared goodness-of-fit. Delicate idea!
Impressive results! Few questions though:-<p>1) Do you use sliding window approach or exponential decay?<p>2) When you say groups of similar edges, do they have to be spatially close to each other? or can they be equally distributed in the graph?<p>3) Can an attacker figure out the optimum time difference between his attacks such that MIDAS doesn't detect it as an anomaly? The time gap is just sufficient enough for the algorithm to weed out the potentially malicious micro-cluster as obsolete.<p>Seems like an interesting extension to your work. Best of luck!
Great work! However, from what I understand, microcluster detection seems to be very similar to subgraph detection. Is there a difference between the two problems? And if yes, then how do their complexities compare?
I found the paper to be quite exciting, and it is also solving one of the main tasks in anomaly detection, that is real-time anomaly detection. With other machine/deep learning based algorithms, one can model the data distribution quite efficiently, but there is a tradeoff of flagging time. With the statistical method suggested in the paper, you not only get state-of-the-art results but are also able to report the results in real-time, a big plus.