Heh, funny to see this popup here :)<p>The performance on Apple Silicon should be much better today compared to what is shown in the video as whisper.cpp now runs fully on the GPU and there have been significant improvements in llama.cpp generation speed over the last few months.
This is cool. I hooked up Llama to an open-source TTS model for a recent project and there was lots of fun engineering that went into it.<p>On a different note:<p>I think the most useful coding copilot tools for me reduce "manual overhead" without attempting to do any hard thinking/problem solving for me (such as generating arguments and types from docstrings or vice-versa, etc.). For more complicated tasks you really have to give copilot a pretty good "starting point".<p>I often talk to myself while coding. It would be extremely, extremely futuristic (and potentially useful) if a tool like this embedded my speech into a context vector and used it to as an additional copilot input so the model has a better "starting point".<p>I'm a late adopter of copilot and don't use it all the time but if anyone is aware of anything like this I'd be curious to hear about it.
I'm getting a "floating point exception" when running ./talk-llama on arch and debian. Already checked sdl2lib and ffmpeg (because of this issue: <a href="https://github.com/ggerganov/whisper.cpp/issues/1325">https://github.com/ggerganov/whisper.cpp/issues/1325</a>) but nothing seems to fix it. Anyone else?
Aren't there text to talk solution that can receive a stream of text so one doesn't have to wait for llama to finish production before getting the answer talked out?<p>I guess it'd only work if the model can keep the buffer filled fast enough so the tts engine doesn't stall.
Would it be possible to reduce lag by streaming groups of ~6 tokens at a time to the TTS as they're generated, instead of waiting for the full LLM response before beginning to speak it?
This makes me wonder, what's the equivalent to ollama for whisper/SOTA OS tts models? I'm really happy with ollama for locally running OS LLMs, but I don't know of any project that makes it <i>that</i> simple to set up whisper locally.
Could anyone explain the capability of this in plain English? Can this learn and retain context of a chat and build on some kind of long term memory? Thanks
Does anybody have a quick start for building it all in Windows for this?
I could probably check it out as a VS project and build but I'm going to bet since it's not documented it's going to have issues specifically because the Linux build instructions are the only ones that are a first class citizen...
What are currently the best/go-to approaches to detect the end of an utterance? This can be tricky even in conversations between humans, requiring semantic information about what the other person is saying. I wonder if there’s any automated strategy that works well enough.
very sick demo! if anyone wants to work on packaging this up for broader (swiftUI/macos) consumption, I just added an issue <a href="https://github.com/psugihara/FreeChat/issues/30">https://github.com/psugihara/FreeChat/issues/30</a>