I wonder why are the example videos this specific clip compilation format.<p>It feels to me that to navigate that, you essentially have to index 500 10-seconds videos, and that looks a lot easier than retrieving information that is in an actual 1 hour long video, because the later one will have a lot more of easy to mix-up moments. So maybe it hides an inability to answer questions about actual long videos (in the paper, the other example videos cap at 3 minutes length for what I can see).<p>On the other hand, maybe it's just for results presentation purposes, because it is much more readily "verifiable" for everyone than saying "trust us, in this very long video, there's the correct answer unarguably".<p>So if someone happens to more about that, I'd be very interested