People complain that the magic of programming has been lost because all we do is stitch together APIs. They're oversimplying the work done, those people are cynical, this is amazing.
>Matching faces to voices relies on simple co-occurence heuristic, and will not work in certain scenarios (e.g. if the whole conversation between two people is recorded from a single angle)<p>This seems like the really hard part. maybe if there was a way to find the time lips moves for a face. or guess gender and age of both face and voice.. or If the audio is a stereo mix, using relative position
Very cool. Zack Friedman made something similar to wear on a hoody. It wasn't perfect, but it worked. <a href="https://youtu.be/mTK8dIBJIqg" rel="nofollow">https://youtu.be/mTK8dIBJIqg</a>
So I've taken to saving many, many posts from HN with <a href="https://web.archive.org/save/" rel="nofollow">https://web.archive.org/save/</a> and this is the amusing result I got when trying to save this one: "This host has been already captured 100,091.0 times today. Please try again tomorrow. Please email us at "info@archive.org" if you would like to discuss this more." Haha.
Pretty darn cool!<p>I almost want to try and implement this in OpenCVjs and see if it is possible to build a browser extension to do this in the browser itself
How easy is it to translate word by word like that? I was under the impression that generally it is hard because different languages have different word orders. Is it not necessary to have the whole sentence before starting? (Or maybe Polish is just conveniently similar to English in word order?)
This is utterly incredible. I have so many ideas after reading way too much scifi and watching Ridley Scott’s scifi series, Raised By Wolves (what if we could create a benevolent, kind and caring ai to help humans grow and navigate into this world, like Father in the series)?<p>I want to jump in, badly. How would one go about picking up the skills needed to create stuff like this? As pragmatically, concretely and efficiently as possible without getting sidetracked in overly theoretical distractions?<p>I’m a fullstack engineer and have an MS in CS and pretty good math chops, but I sadly only took 1 machine learning course in all of my formal education.<p>How do I get into this (gpt-3, chatgpt are also on my mind)? Please, any books, moocs, etc
Is it possible to have this tool run locally and use it for myself? I don't see any instructions on the Readme, all I see are ways to set it up for development and ways to deploy it to some 3rd party cloud solutions.
Videos encode so much information, and it looks pretty cool when such projects extract higher level information to play with. And recent models like whisper and clip are amazing to help make sense of that information even for personal projects.<p>We are also trying to do something similar[0], but still a lot of work remaining. Idea is to allow real time processing of any video using just a CPU.
[0] <a href="https://www.youtube.com/watch?v=E7UPj9blnWc">https://www.youtube.com/watch?v=E7UPj9blnWc</a>
I developed an online course on serverless machine learning, where you can learn some of the principles of refactoring ML systems into separate feature/training/inference pipelines:
<a href="https://github.com/featurestoreorg/serverless-ml-course">https://github.com/featurestoreorg/serverless-ml-course</a><p>Some of the students have built similar systems, for example chaining Whisper and ChatGPT or translation or sentiment analysis of transcribed text, such as here (transcribe Swedish and tell me the sentiment of the text):
<a href="https://huggingface.co/spaces/Chrysoula/voice_to_text_swedish" rel="nofollow">https://huggingface.co/spaces/Chrysoula/voice_to_text_swedis...</a>
I would like to point out: As someone who is a newbie coder, I had a ton of fun learning how your code works with the help of ChatGPT. Even learning about unfamiliar topics like serverless apps, nlp transformers, or yaml vs json lol
Does anyone have a link to something similar for real-time captions (audio only - (.g. Google Meet, Twitter Spaces)? Or is that typically done on the device itself