科技回声

Hey HN,I’m Scott Stephenson, one of the cofounders of Deepgram (<a href="https://www.deepgram.com/" rel="nofollow">https://www.deepgram.com/</a>). Getting information from recorded phone calls and meetings is time-intensive, costly, and imprecise. Our speech recognition API allows businesses to reliably translate high-value unstructured audio into accurate, parsable data.Deepgram started when my cofounder Noah Shutty and I had just finished looking for dark matter (while in a particle physics lab at University of Michigan). Noah had the idea to start recording all audio from his life, 24/7. After gathering hundreds of hours of recordings, we wanted to search inside this fresh dataset, but realized there wasn’t a good way to find specific moments. So, we built a tool utilizing the same AI techniques we used for finding dark matter particle events, and it ended up working pretty well. A few months later, we made a single page demo to show off “searching through sound” and posted to HN. Pretty soon we were in the winter batch of YC in 2016 (<a href="https://techcrunch.com/2016/09/27/launching-a-google-for-sound-deepgram-raises-1-8-million/" rel="nofollow">https://techcrunch.com/2016/09/27/launching-a-google-for-sou...</a>).I’d say we didn’t know what we were getting ourselves into. Speech is a really big problem with a huge market, but it’s also a tough nut to crack. For decades, companies have been unable to get real learnings from their massive amounts of recorded audio (some companies record more than 1,000,000 minutes of call center calls every single day). They have a few reasons why they record the audio — some for compliance, some for training, and some for market research. The questions they’re trying to answer are usually as simple as:<pre><code> - “What is the topic of the call?” - “Is this call compliant?” (did I say: my company name, my name, and “this call may be recorded”) - “Are people getting their problems solved quickly?” - “Do my agents need training?” - “What are our customers talking about? Competitors? Our latest marketing campaign?” </code></pre> It’s the most intimate view you can get on your customers, but the problem is so large and difficult to solve that companies pushed it into the corner over the past couple decades, only trying to mitigate the bleeding. Current tools only transcribe with around 50-60% accuracy on real-world, noisy, accented, industry-specific audio (don’t believe the ‘human level accuracy’ hype). When companies start solving problems using speech data, they first want transcription that’s accurate. After accuracy, comes scale — another big problem. Speech processing is computationally expensive and slow. Imagine trying to get into an iterative problem solving loop when you have to wait 24 hours to get your transcripts back.So we’ve set our sights on building the speech company. Competition from companies like Google, Amazon, and Nuance is real, but none of these approach speech recognition like we do. We've rebuilt the entire speech processing stack, replacing heuristics and stats based speech processing with fully end-to-end deep learning (we use CNNs and RNNs). Using GPUs, we train speech models to learn customer’s unique vocabularies, accents, product names, and acoustic environments. This can be the difference between correctly capturing “wasn’t delivered” and “was in the liver.” We’ve focused on speed since we think that’s very important for exploration and scale. Our API returns hour-long transcripts interactively in seconds. It’s a tool many businesses wish they had.So far we’ve released tools that:<pre><code> - transcribe speech with timestamps - support real-time streaming - have multi-channel support - understand multiple languages (in beta now) - allow you to deeply search for keywords and phrases - transcribe to phonemes - get more accurate with use </code></pre> Some of those are better mousetraps of things you’re familiar with and some are completely new levers to pull in your audio data. We’ve built the core on English but now we’re releasing the tools for all of the Americas. (aside: You can transfer learn speech and it works well!)Accuracy will continue to improve for transcription, but I think we can do more. It's such a large problem, and we really want to make a dent in “solving speech”. That means asking, truly: “What can a human do?“People can, with little context, jump into a conversation and determine:<pre><code> - What are the words? When are they said? Who said what? - Is this person young/old? Male/Female? Exhausted/energetic? - Where is there confusion? - What language are they speaking? What’s the speaker’s accent? - What’s the topic of the conversation? Small talk or real? Is it going well? </code></pre> Some of those things are being worked on now: additional language support, language and accent detection, sentiment analysis, auto-summarization, topic modeling, and more.We’d love to hear your feedback and ideas.

7 条评论

btown超过 6 年前

(FYI your <a href="https://deepgram.com/v2/docs" rel="nofollow">https://deepgram.com/v2/docs</a> links are giving "error": "Not Found" JSON responses.)I love progress in this space. Something I also think is necessary, though, is innovation in the discoverability interfaces around speech data. Can you search over potential transcriptions weighted by their likelihood, rather than just doing full-text search on the most-likely transcriptions? Can you visualize multiple potential transcriptions inline without overloading someone's visual cortex with information? Can you one-click-to-listen to any specific line? Can you enable people to switch conversations on the fly to an "off-the-record" mode, with such confidence that the default can be that every conversation is highlighted? Can you do all of this from Slack? Can you make setup a one-click process with Twilio OAuth? Can you do all of this from a web app that requires no coding?All this, I'm sure, is part of an ecosystem that will be built on tools like yours, and that ecosystem fundamentally depends on the quality of the data - so it makes sense for you all to focus there first. But to the extent you want to capture the entire "stack," there's a tremendous space for someone to take the level of "passion" for data quality and apply that same instinct to quality-of-experience.

评论 #18709882 未加载

评论 #18709903 未加载

trevyn超过 6 年前

>Noah had the idea to start recording all audio from his life, 24/7Want this as a product. :)

评论 #18711444 未加载

vitovito超过 6 年前

Do you plan to offer something around one-shot machine transcription with offline/on-prem search?I have ~200k hours of legacy audio I'd love to be able to do a fuzzy (phonetic?) search on, to pull content from and get real (human-edited) transcriptions of important stuff to resurface it, but there's not a lot of incentive to push it through a service for a quarter million dollars and then also pay to store and search it, since we're currently doing without it. Doing it at extremely low priority, delivering it over a long span of time, for an order of magnitude cheaper, with our IT standing up some stock fuzzy search engine, is a pretty easy sell, though.

评论 #18710429 未加载

dumbfoundded超过 6 年前

Hi! Thanks for sharing and I have a few questions.- How does your WER compare to other engines? <a href="https://medium.com/descript/which-automatic-transcription-service-is-the-most-accurate-2018-2e859b23ed19" rel="nofollow">https://medium.com/descript/which-automatic-transcription-se...</a>- How do you gather data?- Where do you see your long-term differentiation? Is it the features you build on top of other engines or is it the engine itself?Disclaimer: I led engineering for temi.com (a competitor of your's) but am no longer affiliated with it.

评论 #18710237 未加载

jaredwiener超过 6 年前

This looks interesting. Curious what the pricing is? I don't see it on the website.

评论 #18709745 未加载

ivankirigin超过 6 年前

With multiple speakers, can you identify who is speaking?If you were in a conference room with multiple threads of conversation, could you tease out all of them?

评论 #18710517 未加载

pouta超过 6 年前

How does this compare to Trint in terms of speech recognition performance?

7 条评论

btown超过 6 年前

评论 #18709882 未加载

评论 #18709903 未加载

trevyn超过 6 年前

>Noah had the idea to start recording all audio from his life, 24/7Want this as a product. :)

评论 #18711444 未加载

vitovito超过 6 年前

评论 #18710429 未加载

dumbfoundded超过 6 年前

评论 #18710237 未加载

jaredwiener超过 6 年前

This looks interesting. Curious what the pricing is? I don't see it on the website.

评论 #18709745 未加载

ivankirigin超过 6 年前

With multiple speakers, can you identify who is speaking?If you were in a conference room with multiple threads of conversation, could you tease out all of them?

评论 #18710517 未加载

pouta超过 6 年前

How does this compare to Trint in terms of speech recognition performance?

Launch HN: Deepgram (YC W16) – Scalable Speech API for Businesses

7 条评论

Launch HN: Deepgram (YC W16) – Scalable Speech API for Businesses

7 条评论