I made a simple Chrome extension that similarly pulls down the video transcript and sends this to the openai chat completions endpoint:
<a href="https://github.com/josephrmartinez/AskYouTube">https://github.com/josephrmartinez/AskYouTube</a><p>This extension allows me to "ask" the model to perform a task on the video content:
- "Give me the materials list" (for a diy video)
- "What was the recommended book?" (for a 2+ hour podcast where they made a reference I can't find again easily)
- "Extract the recommended protocol" (for 3+ hour health videos)
- "Provide a counter argument" (for when I'm getting bored...)<p>Big plus is that you DO NOT need to wait for the ad to play through. I can just navigate to the video and send in a query without having to watch any ads.<p>Youtube transcripts are pretty rough. At first, I used Whisper to create a better transcript. But my primary use is to ask something of the youtube video - I found that slinging the so-so transcript along with my task was totally fine. Really simple project: Chrome extension in just html, css, and js. FastAPI server for the openai endpoint. Server function does a quick tokenization on the transcript to determine if I need to use the gpt4 model for the 128k context window or if the gptt3.5 16k context window is okay.<p>Naturally, here is a short youtube demo of the extension: <a href="https://www.youtube.com/watch?v=M1zq9NKIcbw&t=54s" rel="nofollow noreferrer">https://www.youtube.com/watch?v=M1zq9NKIcbw&t=54s</a>
Since I had the same question as everyone else, it seems like it must be using just the transcript. When asking about one of those "8k HDR" showcase videos (with no speech), Bard responds with:<p>> I'm sorry, but I'm unable to access this YouTube content. This is possible for a number of reasons, but the most common are: the content isn't a valid YouTube link, potentially unsafe content, or the content does not have a captions file that I can read.
Whisper (OpenAI speech-to-text) is already trained on YT content; amusingly, if you mumble incoherently, its most-probable completion for noise is “thanks for watching!”
If it gets very good at "understanding" YouTube and other video content, Google could maybe find some kind of training data advantage not available to a pure text based model.
Here is go now. The AI revolution begins with learning how-to videos. Just create the latent space for video/visual understanding, it's going to be very interesting to explore that.
I wonder how this works. It sounds like it's transcript driven, but then the next question is - were the transcripts automatically created or user-defined?<p>If the former, is this not going to run into the same issue as training AI on datasets created by AI? I experience so many mistranslated words when using automatic transcripts that I can't imagine the quality of data is excellent without supporting the transcripts with video inference.
Is there any reason to believe YouTube content will only be trained on by Bard?<p>Stuff like YouTubeDL exists and works fine. I would assume that others could scrape and train on it, too? Or does that sound outlandishly expensive?
Open source version of something similar:<p><a href="https://github.com/PKU-YuanGroup/Video-LLaVA">https://github.com/PKU-YuanGroup/Video-LLaVA</a>
Understanding the videos is all very well but can it understand:<p>1- the popularity of "tier list" videos?<p>2- why those douchetuber "prank" videos exist?<p>3- Logan and/or Jake Paul?