TechEcho

I'm wondering if it would make sense to use an H.264/5/6/AV1 encoder as the tokenizer, and then find some set of embeddings that correspond to the data in the resulting bitstream. The tokenization they're doing is morally equivalent to what video codecs already do.

Would event camera input data be useful here?<p><a href="https://en.wikipedia.org/wiki/Event_camera" rel="nofollow">https://en.wikipedia.org/wiki/Event_camera</a><p>“Event cameras do not capture images using a shutter as conventional (frame) cameras do. Instead, each pixel inside an event camera operates independently and asynchronously, reporting changes in brightness as they occur, and staying silent otherwise.”

Interestingly, biological vision for reptiles (and probably other species) works largely on the same principle. It tends to filter out static background.

Isn't this like Differential Transformers that worked based on differences?

For training, would it be useful to stabilize the footage first?

What would be the applications of this that is different from regular transformers? Perhaps stupid question.

Interestingly, biological vision for reptiles (and probably other species) works largely on the same principle. It tends to filter out static background.

Isn't this like Differential Transformers that worked based on differences?

For training, would it be useful to stabilize the footage first?

What would be the applications of this that is different from regular transformers? Perhaps stupid question.

Don't Look Twice: Faster Video Transformers with Run-Length Tokenization

6 comments

Don't Look Twice: Faster Video Transformers with Run-Length Tokenization

6 comments