I think their model should take a second pass on the words and probabilities, independent of the video.<p>Look at their example:<p><pre><code> Animal: 97.76%
Tiger: 90.11%
Terrestrial animal: 68.17%
</code></pre>
So we are 90% sure it is a tiger but only 68% sure it is a land animal? I don't think that makes sense.<p>It could be that this is a weakness of seeding AI data with human inputs. I can believe that 90% of people who saw the video would agree that it is a tiger, while fewer would agree it is a terrestrial animal, because they don't know what terrestrial means.