TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Serverless Video Transcription inspired by Cyberpunk 2077

774 pointsby pierremenardover 2 years ago

25 comments

piyhover 2 years ago
People complain that the magic of programming has been lost because all we do is stitch together APIs. They're oversimplying the work done, those people are cynical, this is amazing.
评论 #34102077 未加载
评论 #34103807 未加载
评论 #34102202 未加载
评论 #34102973 未加载
totetsuover 2 years ago
&gt;Matching faces to voices relies on simple co-occurence heuristic, and will not work in certain scenarios (e.g. if the whole conversation between two people is recorded from a single angle)<p>This seems like the really hard part. maybe if there was a way to find the time lips moves for a face. or guess gender and age of both face and voice.. or If the audio is a stereo mix, using relative position
评论 #34100834 未加载
MetallicCloudover 2 years ago
Very cool. Zack Friedman made something similar to wear on a hoody. It wasn&#x27;t perfect, but it worked. <a href="https:&#x2F;&#x2F;youtu.be&#x2F;mTK8dIBJIqg" rel="nofollow">https:&#x2F;&#x2F;youtu.be&#x2F;mTK8dIBJIqg</a>
EchoReflectionover 2 years ago
So I&#x27;ve taken to saving many, many posts from HN with <a href="https:&#x2F;&#x2F;web.archive.org&#x2F;save&#x2F;" rel="nofollow">https:&#x2F;&#x2F;web.archive.org&#x2F;save&#x2F;</a> and this is the amusing result I got when trying to save this one: &quot;This host has been already captured 100,091.0 times today. Please try again tomorrow. Please email us at &quot;info@archive.org&quot; if you would like to discuss this more.&quot; Haha.
vorpalhexover 2 years ago
This would be a killer app for a google glass type device. A near realtime closed captioning device with translation.
评论 #34171894 未加载
navanchauhanover 2 years ago
Pretty darn cool!<p>I almost want to try and implement this in OpenCVjs and see if it is possible to build a browser extension to do this in the browser itself
bee_riderover 2 years ago
How easy is it to translate word by word like that? I was under the impression that generally it is hard because different languages have different word orders. Is it not necessary to have the whole sentence before starting? (Or maybe Polish is just conveniently similar to English in word order?)
评论 #34102619 未加载
评论 #34102140 未加载
评论 #34102191 未加载
Mizzaover 2 years ago
I&#x27;m really hoping somebody comes along and puts this in a format that I can attach to my head inconspicuously.
评论 #34102187 未加载
评论 #34104372 未加载
sj8822over 2 years ago
This is utterly incredible. I have so many ideas after reading way too much scifi and watching Ridley Scott’s scifi series, Raised By Wolves (what if we could create a benevolent, kind and caring ai to help humans grow and navigate into this world, like Father in the series)?<p>I want to jump in, badly. How would one go about picking up the skills needed to create stuff like this? As pragmatically, concretely and efficiently as possible without getting sidetracked in overly theoretical distractions?<p>I’m a fullstack engineer and have an MS in CS and pretty good math chops, but I sadly only took 1 machine learning course in all of my formal education.<p>How do I get into this (gpt-3, chatgpt are also on my mind)? Please, any books, moocs, etc
评论 #34102788 未加载
slowmotionyover 2 years ago
Is it possible to have this tool run locally and use it for myself? I don&#x27;t see any instructions on the Readme, all I see are ways to set it up for development and ways to deploy it to some 3rd party cloud solutions.
warangalover 2 years ago
Videos encode so much information, and it looks pretty cool when such projects extract higher level information to play with. And recent models like whisper and clip are amazing to help make sense of that information even for personal projects.<p>We are also trying to do something similar[0], but still a lot of work remaining. Idea is to allow real time processing of any video using just a CPU. [0] <a href="https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=E7UPj9blnWc">https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=E7UPj9blnWc</a>
评论 #34106379 未加载
chinchilla2020over 2 years ago
This is impressive and so is the writeup. How did you create the diagrams in the readme? I really like that visual format for diagrams.
评论 #34100315 未加载
ArekDymalskiover 2 years ago
Now this is both cool piece of tech and practically useful. Very nice!
jamesblondeover 2 years ago
I developed an online course on serverless machine learning, where you can learn some of the principles of refactoring ML systems into separate feature&#x2F;training&#x2F;inference pipelines: <a href="https:&#x2F;&#x2F;github.com&#x2F;featurestoreorg&#x2F;serverless-ml-course">https:&#x2F;&#x2F;github.com&#x2F;featurestoreorg&#x2F;serverless-ml-course</a><p>Some of the students have built similar systems, for example chaining Whisper and ChatGPT or translation or sentiment analysis of transcribed text, such as here (transcribe Swedish and tell me the sentiment of the text): <a href="https:&#x2F;&#x2F;huggingface.co&#x2F;spaces&#x2F;Chrysoula&#x2F;voice_to_text_swedish" rel="nofollow">https:&#x2F;&#x2F;huggingface.co&#x2F;spaces&#x2F;Chrysoula&#x2F;voice_to_text_swedis...</a>
评论 #34103380 未加载
评论 #34120307 未加载
ReFruityover 2 years ago
Nice! Also would be cool to see the text positioned above the face and translated just like in the cyberpunk gif.
评论 #34108601 未加载
yDogboneover 2 years ago
I would like to point out: As someone who is a newbie coder, I had a ton of fun learning how your code works with the help of ChatGPT. Even learning about unfamiliar topics like serverless apps, nlp transformers, or yaml vs json lol
hollerover 2 years ago
This is awesome, impressed you threw this together over a weekend!<p>What did you use to make that entity diagram?<p>edit: answered below
评论 #34101330 未加载
abidlabsover 2 years ago
Neat use of Gradio &amp; Modal
评论 #34102018 未加载
dansoover 2 years ago
Clever hack&#x2F;solution, and the thorough documentation is appreciated!
jamesblondeover 2 years ago
This is a great example of serverless machine learning with modal.
fareeshover 2 years ago
Does anyone have a link to something similar for real-time captions (audio only - (.g. Google Meet, Twitter Spaces)? Or is that typically done on the device itself
评论 #34104054 未加载
daedalus2027over 2 years ago
It seems there is a niche market in this...
评论 #34101599 未加载
schizo89over 2 years ago
This is cool. I guess it&#x27;s possible to this all in single multi-task NN
jacquesmover 2 years ago
Mindblowing that this was so easy to build.<p>Coming soon to a phone near you, and in realtime.
NetOpWibbyover 2 years ago
This is incredible