I'm struck by how very dissimilar each successive "match" in the video is. If this is a neural network that makes matches based on video, how on earth is it picking matches from completely different training videos for every subsequent impact? It might happen if the training videos only contained a single impact sample, but that's not the case here, the paper says each training video has 48 actions.<p>I can totally understand if the authors wanted to make sure it doesn't pick the exact same sample multiple times in a row and penalized duplicates, but I don't see any mention of that in the paper, and even if they did, I'd expect to see subsequent matches from the same training video, rather than picks from completely different videos.
The neural network could learn to make a "boom" sound for explosions, e.g. Michael Bay films.<p>It also reminded me of this GIF:
How to get a song stuck in someones head within 3 frames:
<a href="http://imgur.com/gallery/c18TRq0" rel="nofollow">http://imgur.com/gallery/c18TRq0</a>
I would love to experience an action movie that uses this to generate the sound effects automatically.<p>Would also be a nice tool for professional foley specialists to automate the process.
I can hear the sound of people's voices when I am watching a video with the sound off, even if I have never heard them speak before. I had glue ear that went undiagnosed for a long time as a kid, so that may have something to do with it.
Anyone needing an idea for a hack project: make a realtime version of this for phone or webcam. Bonus points for swapping in ridiculous sound sets, or keying augmented-reality visual fx overlays.