This is very cool. It looks from the videos like the next step for them is to provide some sort of temporal stability so that detected objects don't get temporarily forgotten across frames and so the bounds expand and contract smoothly. It's obvious that the detection is being run frame-at-a-time.<p>I also wonder to what extent merging the detection with underlying P-frame information from the video codecs would help. Knowing that a segment of video just moved to the left would mean the detected object could be moved to the left by the same amount, even if it was passing behind another object. Calculating the movement vectors independently seems silly if you can get that data from the underlying video codec itself.
They named their method "YOLO"…<p>Edit: to add something more "helpful" to this comment, their paper links to a YouTube channel [1] that shows demos of their method, which I think is great.<p>[1] <a href="https://goo.gl/bEs6Cj" rel="nofollow">https://goo.gl/bEs6Cj</a>
This is really cool, even inspiring.
Not just because it's one of the first examples I've seen of accurate, real-time detection powered by neural nets, but because they're getting these results via black magic, basically.<p>The objective function is defined heuristically, and involves about five different sub-objectives (top of page four).
Some of the parameters chosen seem to be rough guesses, as does the decision to scale up the images to twice the resolution when moving from classification (the pre-training task) to detection.<p>It seems miraculous that a process of estimating and refinement, guided by experience, can work on tasks where you have no mathematical guarantee that a good solution can be found.
Maybe in time we'll build the theory that explains just why deep learning works so well, but for now I'm just kinda awed and impressed every time one of these stories comes out.
In the paper they use the abbreviation mAP without explaining what it is or providing a reference, such as "Fast YOLO,
processes an astounding 155 frames per second while
still achieving double the mAP of other real-time detectors"; do folks know what mAP is?
One of the authors has additional information posted here: <a href="http://pjreddie.com/darknet/yolo/" rel="nofollow">http://pjreddie.com/darknet/yolo/</a>