Wow this field is moving fast. Just yesterday I was commenting on RCNN which apparently is old news now! Although it appears the same basic architecture is still used. If you're looking for a nice history of object detection leading up to a survey of current methods and a discussion of RCNN, check out this talk by Larry Zitnick: <a href="https://youtu.be/UXHWNNzdPVM" rel="nofollow">https://youtu.be/UXHWNNzdPVM</a>, one of the key people behind the OP.<p>At the end of this post they talk about extending this to video. This, in my opinion, is a much harder problem. Convnets are standard for images but no one has found a really good architecture for video. Some key questions I have about video:
1) How do humans perceive moving things? My guess is that there are major differences down to the visual cortex that would warrant brand new architectures.
2) Could we operate neural nets directly on encoded data, such as h.264? Not only would this be computationally much more efficient than decoding video into frames and processing each one but some codecs give you motion vectors and other useful temporal data for free.
3) How do we handle temporal information? LSTMs work well for sequential data and there's been some work on using them for video but I'm not aware of much success on using sequential networks to detect things like plot points in movies.