It seems GPT3 suggests the best token/word given the previous words.<p>Will it be possible to, given a large enough dataset of MP3 files, predict the next millisecond of audio based on previous milliseconds of audio and generate songs? Will we generate videos by predicting the next best frame?<p>Is there any technical reason we couldn't collect first person audio and video data with the cameras and microphone on a Quest Pro and generate how the next few minutes of our life could look?
> predict the next millisecond of audio based on previous milliseconds of audio<p>Not milliseconds, but AudioLM [1] already does it with just seconds, for speech (and piano). Results are already very convincing (to me).<p>[1] <a href="https://google-research.github.io/seanet/audiolm/examples/" rel="nofollow">https://google-research.github.io/seanet/audiolm/examples/</a>
"Signatures" on videos to prove that, yes, they are "authentic" and not AI-generated. I have no idea how it'd be enforced though.