Serverless Video Transcription inspired by Cyberpunk 2077

774 pointsby pierremenardover 2 years ago

25 comments

piyhover 2 years ago

People complain that the magic of programming has been lost because all we do is stitch together APIs. They're oversimplying the work done, those people are cynical, this is amazing.

评论 #34102077 未加载

评论 #34103807 未加载

评论 #34102202 未加载

评论 #34102973 未加载

totetsuover 2 years ago

>Matching faces to voices relies on simple co-occurence heuristic, and will not work in certain scenarios (e.g. if the whole conversation between two people is recorded from a single angle)This seems like the really hard part. maybe if there was a way to find the time lips moves for a face. or guess gender and age of both face and voice.. or If the audio is a stereo mix, using relative position

评论 #34100834 未加载

MetallicCloudover 2 years ago

Very cool. Zack Friedman made something similar to wear on a hoody. It wasn't perfect, but it worked. <a href="https://youtu.be/mTK8dIBJIqg" rel="nofollow">https://youtu.be/mTK8dIBJIqg</a>

EchoReflectionover 2 years ago

So I've taken to saving many, many posts from HN with <a href="https://web.archive.org/save/" rel="nofollow">https://web.archive.org/save/</a> and this is the amusing result I got when trying to save this one: "This host has been already captured 100,091.0 times today. Please try again tomorrow. Please email us at "info@archive.org" if you would like to discuss this more." Haha.

vorpalhexover 2 years ago

This would be a killer app for a google glass type device. A near realtime closed captioning device with translation.

评论 #34171894 未加载

navanchauhanover 2 years ago

Pretty darn cool!I almost want to try and implement this in OpenCVjs and see if it is possible to build a browser extension to do this in the browser itself

bee_riderover 2 years ago

How easy is it to translate word by word like that? I was under the impression that generally it is hard because different languages have different word orders. Is it not necessary to have the whole sentence before starting? (Or maybe Polish is just conveniently similar to English in word order?)

评论 #34102619 未加载

评论 #34102140 未加载

评论 #34102191 未加载

Mizzaover 2 years ago

I'm really hoping somebody comes along and puts this in a format that I can attach to my head inconspicuously.

评论 #34102187 未加载

评论 #34104372 未加载

sj8822over 2 years ago

This is utterly incredible. I have so many ideas after reading way too much scifi and watching Ridley Scott’s scifi series, Raised By Wolves (what if we could create a benevolent, kind and caring ai to help humans grow and navigate into this world, like Father in the series)?I want to jump in, badly. How would one go about picking up the skills needed to create stuff like this? As pragmatically, concretely and efficiently as possible without getting sidetracked in overly theoretical distractions?I’m a fullstack engineer and have an MS in CS and pretty good math chops, but I sadly only took 1 machine learning course in all of my formal education.How do I get into this (gpt-3, chatgpt are also on my mind)? Please, any books, moocs, etc

评论 #34102788 未加载

slowmotionyover 2 years ago

Is it possible to have this tool run locally and use it for myself? I don't see any instructions on the Readme, all I see are ways to set it up for development and ways to deploy it to some 3rd party cloud solutions.

warangalover 2 years ago

Videos encode so much information, and it looks pretty cool when such projects extract higher level information to play with. And recent models like whisper and clip are amazing to help make sense of that information even for personal projects.We are also trying to do something similar[0], but still a lot of work remaining. Idea is to allow real time processing of any video using just a CPU. [0] <a href="https://www.youtube.com/watch?v=E7UPj9blnWc">https://www.youtube.com/watch?v=E7UPj9blnWc</a>

评论 #34106379 未加载

chinchilla2020over 2 years ago

This is impressive and so is the writeup. How did you create the diagrams in the readme? I really like that visual format for diagrams.

评论 #34100315 未加载

ArekDymalskiover 2 years ago

Now this is both cool piece of tech and practically useful. Very nice!

jamesblondeover 2 years ago

I developed an online course on serverless machine learning, where you can learn some of the principles of refactoring ML systems into separate feature/training/inference pipelines: <a href="https://github.com/featurestoreorg/serverless-ml-course">https://github.com/featurestoreorg/serverless-ml-course</a>Some of the students have built similar systems, for example chaining Whisper and ChatGPT or translation or sentiment analysis of transcribed text, such as here (transcribe Swedish and tell me the sentiment of the text): <a href="https://huggingface.co/spaces/Chrysoula/voice_to_text_swedish" rel="nofollow">https://huggingface.co/spaces/Chrysoula/voice_to_text_swedis...</a>

评论 #34103380 未加载

评论 #34120307 未加载

ReFruityover 2 years ago

Nice! Also would be cool to see the text positioned above the face and translated just like in the cyberpunk gif.

评论 #34108601 未加载

yDogboneover 2 years ago

I would like to point out: As someone who is a newbie coder, I had a ton of fun learning how your code works with the help of ChatGPT. Even learning about unfamiliar topics like serverless apps, nlp transformers, or yaml vs json lol

hollerover 2 years ago

This is awesome, impressed you threw this together over a weekend!What did you use to make that entity diagram?edit: answered below

评论 #34101330 未加载

abidlabsover 2 years ago

Neat use of Gradio & Modal

评论 #34102018 未加载

dansoover 2 years ago

Clever hack/solution, and the thorough documentation is appreciated!

jamesblondeover 2 years ago

This is a great example of serverless machine learning with modal.

fareeshover 2 years ago

Does anyone have a link to something similar for real-time captions (audio only - (.g. Google Meet, Twitter Spaces)? Or is that typically done on the device itself

评论 #34104054 未加载

daedalus2027over 2 years ago

It seems there is a niche market in this...

评论 #34101599 未加载

schizo89over 2 years ago

This is cool. I guess it's possible to this all in single multi-task NN

jacquesmover 2 years ago

Mindblowing that this was so easy to build.Coming soon to a phone near you, and in realtime.

NetOpWibbyover 2 years ago

This is incredible

25 comments

piyhover 2 years ago

People complain that the magic of programming has been lost because all we do is stitch together APIs. They're oversimplying the work done, those people are cynical, this is amazing.

评论 #34102077 未加载

评论 #34103807 未加载

评论 #34102202 未加载

评论 #34102973 未加载

totetsuover 2 years ago

评论 #34100834 未加载

MetallicCloudover 2 years ago

Very cool. Zack Friedman made something similar to wear on a hoody. It wasn't perfect, but it worked. <a href="https://youtu.be/mTK8dIBJIqg" rel="nofollow">https://youtu.be/mTK8dIBJIqg</a>

EchoReflectionover 2 years ago

vorpalhexover 2 years ago

This would be a killer app for a google glass type device. A near realtime closed captioning device with translation.

评论 #34171894 未加载

navanchauhanover 2 years ago

Pretty darn cool!I almost want to try and implement this in OpenCVjs and see if it is possible to build a browser extension to do this in the browser itself

bee_riderover 2 years ago

评论 #34102619 未加载

评论 #34102140 未加载

评论 #34102191 未加载

Mizzaover 2 years ago

I'm really hoping somebody comes along and puts this in a format that I can attach to my head inconspicuously.

评论 #34102187 未加载

评论 #34104372 未加载

sj8822over 2 years ago

评论 #34102788 未加载

slowmotionyover 2 years ago

warangalover 2 years ago

评论 #34106379 未加载

chinchilla2020over 2 years ago

This is impressive and so is the writeup. How did you create the diagrams in the readme? I really like that visual format for diagrams.

评论 #34100315 未加载

ArekDymalskiover 2 years ago

Now this is both cool piece of tech and practically useful. Very nice!

jamesblondeover 2 years ago

评论 #34103380 未加载

评论 #34120307 未加载

ReFruityover 2 years ago

Nice! Also would be cool to see the text positioned above the face and translated just like in the cyberpunk gif.

评论 #34108601 未加载

yDogboneover 2 years ago

hollerover 2 years ago

This is awesome, impressed you threw this together over a weekend!What did you use to make that entity diagram?edit: answered below

评论 #34101330 未加载

abidlabsover 2 years ago

Neat use of Gradio & Modal

评论 #34102018 未加载

dansoover 2 years ago

Clever hack/solution, and the thorough documentation is appreciated!

jamesblondeover 2 years ago

This is a great example of serverless machine learning with modal.

fareeshover 2 years ago

Does anyone have a link to something similar for real-time captions (audio only - (.g. Google Meet, Twitter Spaces)? Or is that typically done on the device itself

评论 #34104054 未加载

daedalus2027over 2 years ago

It seems there is a niche market in this...

评论 #34101599 未加载

schizo89over 2 years ago

This is cool. I guess it's possible to this all in single multi-task NN

jacquesmover 2 years ago

Mindblowing that this was so easy to build.Coming soon to a phone near you, and in realtime.

NetOpWibbyover 2 years ago

This is incredible