I'm wondering if it would make sense to use an H.264/5/6/AV1 encoder as the tokenizer, and then find some set of embeddings that correspond to the data in the resulting bitstream. The tokenization they're doing is morally equivalent to what video codecs already do.
Would event camera input data be useful here?<p><a href="https://en.wikipedia.org/wiki/Event_camera" rel="nofollow">https://en.wikipedia.org/wiki/Event_camera</a><p>“Event cameras do not capture images using a shutter as conventional (frame) cameras do. Instead, each pixel inside an event camera operates independently and asynchronously, reporting changes in brightness as they occur, and staying silent otherwise.”
Interestingly, biological vision for reptiles (and probably other species) works largely on the same principle. It tends to filter out static background.