Serverless Video Transcription inspired by Cyberpunk 2077

774 点作者 pierremenard超过 2 年前

25 条评论

piyh超过 2 年前

People complain that the magic of programming has been lost because all we do is stitch together APIs. They're oversimplying the work done, those people are cynical, this is amazing.

评论 #34102077 未加载

评论 #34103807 未加载

评论 #34102202 未加载

评论 #34102973 未加载

totetsu超过 2 年前

>Matching faces to voices relies on simple co-occurence heuristic, and will not work in certain scenarios (e.g. if the whole conversation between two people is recorded from a single angle)This seems like the really hard part. maybe if there was a way to find the time lips moves for a face. or guess gender and age of both face and voice.. or If the audio is a stereo mix, using relative position

评论 #34100834 未加载

MetallicCloud超过 2 年前

Very cool. Zack Friedman made something similar to wear on a hoody. It wasn't perfect, but it worked. <a href="https://youtu.be/mTK8dIBJIqg" rel="nofollow">https://youtu.be/mTK8dIBJIqg</a>

EchoReflection超过 2 年前

So I've taken to saving many, many posts from HN with <a href="https://web.archive.org/save/" rel="nofollow">https://web.archive.org/save/</a> and this is the amusing result I got when trying to save this one: "This host has been already captured 100,091.0 times today. Please try again tomorrow. Please email us at "info@archive.org" if you would like to discuss this more." Haha.

vorpalhex超过 2 年前

This would be a killer app for a google glass type device. A near realtime closed captioning device with translation.

评论 #34171894 未加载

navanchauhan超过 2 年前

Pretty darn cool!I almost want to try and implement this in OpenCVjs and see if it is possible to build a browser extension to do this in the browser itself

bee_rider超过 2 年前

How easy is it to translate word by word like that? I was under the impression that generally it is hard because different languages have different word orders. Is it not necessary to have the whole sentence before starting? (Or maybe Polish is just conveniently similar to English in word order?)

评论 #34102619 未加载

评论 #34102140 未加载

评论 #34102191 未加载

Mizza超过 2 年前

I'm really hoping somebody comes along and puts this in a format that I can attach to my head inconspicuously.

评论 #34102187 未加载

评论 #34104372 未加载

sj8822超过 2 年前

This is utterly incredible. I have so many ideas after reading way too much scifi and watching Ridley Scott’s scifi series, Raised By Wolves (what if we could create a benevolent, kind and caring ai to help humans grow and navigate into this world, like Father in the series)?I want to jump in, badly. How would one go about picking up the skills needed to create stuff like this? As pragmatically, concretely and efficiently as possible without getting sidetracked in overly theoretical distractions?I’m a fullstack engineer and have an MS in CS and pretty good math chops, but I sadly only took 1 machine learning course in all of my formal education.How do I get into this (gpt-3, chatgpt are also on my mind)? Please, any books, moocs, etc

评论 #34102788 未加载

slowmotiony超过 2 年前

Is it possible to have this tool run locally and use it for myself? I don't see any instructions on the Readme, all I see are ways to set it up for development and ways to deploy it to some 3rd party cloud solutions.

warangal超过 2 年前

Videos encode so much information, and it looks pretty cool when such projects extract higher level information to play with. And recent models like whisper and clip are amazing to help make sense of that information even for personal projects.We are also trying to do something similar[0], but still a lot of work remaining. Idea is to allow real time processing of any video using just a CPU. [0] <a href="https://www.youtube.com/watch?v=E7UPj9blnWc">https://www.youtube.com/watch?v=E7UPj9blnWc</a>

评论 #34106379 未加载

chinchilla2020超过 2 年前

This is impressive and so is the writeup. How did you create the diagrams in the readme? I really like that visual format for diagrams.

评论 #34100315 未加载

ArekDymalski超过 2 年前

Now this is both cool piece of tech and practically useful. Very nice!

jamesblonde超过 2 年前

I developed an online course on serverless machine learning, where you can learn some of the principles of refactoring ML systems into separate feature/training/inference pipelines: <a href="https://github.com/featurestoreorg/serverless-ml-course">https://github.com/featurestoreorg/serverless-ml-course</a>Some of the students have built similar systems, for example chaining Whisper and ChatGPT or translation or sentiment analysis of transcribed text, such as here (transcribe Swedish and tell me the sentiment of the text): <a href="https://huggingface.co/spaces/Chrysoula/voice_to_text_swedish" rel="nofollow">https://huggingface.co/spaces/Chrysoula/voice_to_text_swedis...</a>

评论 #34103380 未加载

评论 #34120307 未加载

ReFruity超过 2 年前

Nice! Also would be cool to see the text positioned above the face and translated just like in the cyberpunk gif.

评论 #34108601 未加载

yDogbone超过 2 年前

I would like to point out: As someone who is a newbie coder, I had a ton of fun learning how your code works with the help of ChatGPT. Even learning about unfamiliar topics like serverless apps, nlp transformers, or yaml vs json lol

holler超过 2 年前

This is awesome, impressed you threw this together over a weekend!What did you use to make that entity diagram?edit: answered below

评论 #34101330 未加载

abidlabs超过 2 年前

Neat use of Gradio & Modal

评论 #34102018 未加载

danso超过 2 年前

Clever hack/solution, and the thorough documentation is appreciated!

jamesblonde超过 2 年前

This is a great example of serverless machine learning with modal.

fareesh超过 2 年前

Does anyone have a link to something similar for real-time captions (audio only - (.g. Google Meet, Twitter Spaces)? Or is that typically done on the device itself

评论 #34104054 未加载

daedalus2027超过 2 年前

It seems there is a niche market in this...

评论 #34101599 未加载

schizo89超过 2 年前

This is cool. I guess it's possible to this all in single multi-task NN

jacquesm超过 2 年前

Mindblowing that this was so easy to build.Coming soon to a phone near you, and in realtime.

NetOpWibby超过 2 年前

This is incredible

25 条评论

piyh超过 2 年前

People complain that the magic of programming has been lost because all we do is stitch together APIs. They're oversimplying the work done, those people are cynical, this is amazing.

评论 #34102077 未加载

评论 #34103807 未加载

评论 #34102202 未加载

评论 #34102973 未加载

totetsu超过 2 年前

评论 #34100834 未加载

MetallicCloud超过 2 年前

Very cool. Zack Friedman made something similar to wear on a hoody. It wasn't perfect, but it worked. <a href="https://youtu.be/mTK8dIBJIqg" rel="nofollow">https://youtu.be/mTK8dIBJIqg</a>

EchoReflection超过 2 年前

vorpalhex超过 2 年前

This would be a killer app for a google glass type device. A near realtime closed captioning device with translation.

评论 #34171894 未加载

navanchauhan超过 2 年前

Pretty darn cool!I almost want to try and implement this in OpenCVjs and see if it is possible to build a browser extension to do this in the browser itself

bee_rider超过 2 年前

评论 #34102619 未加载

评论 #34102140 未加载

评论 #34102191 未加载

Mizza超过 2 年前

I'm really hoping somebody comes along and puts this in a format that I can attach to my head inconspicuously.

评论 #34102187 未加载

评论 #34104372 未加载

sj8822超过 2 年前

评论 #34102788 未加载

slowmotiony超过 2 年前

warangal超过 2 年前

评论 #34106379 未加载

chinchilla2020超过 2 年前

This is impressive and so is the writeup. How did you create the diagrams in the readme? I really like that visual format for diagrams.

评论 #34100315 未加载

ArekDymalski超过 2 年前

Now this is both cool piece of tech and practically useful. Very nice!

jamesblonde超过 2 年前

评论 #34103380 未加载

评论 #34120307 未加载

ReFruity超过 2 年前

Nice! Also would be cool to see the text positioned above the face and translated just like in the cyberpunk gif.

评论 #34108601 未加载

yDogbone超过 2 年前

holler超过 2 年前

This is awesome, impressed you threw this together over a weekend!What did you use to make that entity diagram?edit: answered below

评论 #34101330 未加载

abidlabs超过 2 年前

Neat use of Gradio & Modal

评论 #34102018 未加载

danso超过 2 年前

Clever hack/solution, and the thorough documentation is appreciated!

jamesblonde超过 2 年前

This is a great example of serverless machine learning with modal.

fareesh超过 2 年前

Does anyone have a link to something similar for real-time captions (audio only - (.g. Google Meet, Twitter Spaces)? Or is that typically done on the device itself

评论 #34104054 未加载

daedalus2027超过 2 年前

It seems there is a niche market in this...

评论 #34101599 未加载

schizo89超过 2 年前

This is cool. I guess it's possible to this all in single multi-task NN

jacquesm超过 2 年前

Mindblowing that this was so easy to build.Coming soon to a phone near you, and in realtime.

NetOpWibby超过 2 年前

This is incredible