Hadn't heard of the thing they were putting their data into, Marqo, a "tensor search for humans" ,
<a href="https://github.com/marqo-ai/marqo">https://github.com/marqo-ai/marqo</a>
A really interesting blog post I found using LLMs for audio search which I think is a pretty nifty/new idea.<p>I've found it cumbersome using some of the new vector DBs (chroma, faiss, etc) to make end to end systems, but with Marqo it doesn't seem too hard.
This is interesting but what problem does it solve better than CTRL+F-ing a transcript? It seems like this would be a worse solution for when the precise way someone says something could be important (ex. journalists parsing an interview, students studying their recorded lectures) and that it would be most useful if you were working with a large volume of recorded audio, such as customer service calls. This makes me somewhat uncomfortable, but perhaps I am not fully understanding how it works.<p>Edit: wording
Both speaker and speech recognition are done in the article using huggingface.<p>Is there anything as good ready to use on-prem for the diarization (speaker recognition)?<p>I've heard good things about whisper(.cpp) for speech recognition and vosk used to be king of that hill...
How does this compare to using Whisper and feeding that into a vector DB and querying with a LLM<p>Pardon the dumb question I only have an elementary understanding