Hello HN,<p>on my NAS I have a script running every night, downloading videos from 25 Youtube channels I deemed interesting to me. The original idea was to have the content offline in case it gets delisted or the channel ceases to exist.
With the arrival of AI assisted transcriptions, I added a cronjob running Whisper on all those files over the past months and now have around 10k transcriptions of the same amount of videos on the disks.<p>The next step to me feels like building a knowledgebase of some sort to be able to access all the knowledge hidden in those videos, these range from history content to GDC talks and livestreams of other devs.
Ideally I would like so search for topic X and get suitable parts of the video transcription back. Since Whisper also saves the position, jumping into the video to the relevant time would be a bonus.
This search ideally not only works on word matching but can also find relevant content via some similarity measure.<p>I have my trusty Information Retrieval Handbook from university days still here and don't shy away from writing something on my own, but I was wondering if there is something out that would offer such a functionality already or at least takes a big part of the workload from me.
Pretty cool.<p>You could embed the transcripts and then they'd be "searchable." <a href="https://platform.openai.com/docs/guides/embeddings/embedding-models" rel="nofollow">https://platform.openai.com/docs/guides/embeddings/embedding...</a>.<p>I've used Supabase with pgvector to do this. You can even store the position next to the embedding in the db so you could jump to the content in the UI. Ping me at my hn handle at gmail if you want more specifics :).