Ask HN: What are the challenges of building a personal assistant like Siri?

101 pointsby mandelaabout 9 years ago

I'm thinking to build a language-specific personal assistant, like Siri or Cortana. I'm excited about the technical challenge but wondering about its feasibility.Some of the challenges are:1. NLP: I am not aware of any non-English library and don't have much background on the subject.2. Voice recognition: same as the above.3. Web crawling: there are tons of libraries for doing crawling and I have a decent understanding of the subject.So few questions:1. are there any other challenge that I have not considered?2. is the project feasible?3. will it succeed?

31 comments

skewartabout 9 years ago

Most of the comments here are pretty discouraging, which is reasonable. It sounds like an insanely ambitious project for a single non-expert to take on. But I completely disagree.You should go for it. Even if you never really get it to work you'll most likely learn a ton along the way. And if you even halfway pull it off it would make an epic "Show HN". Also, a lot of the required technologies are advancing at an incredible rate (both the state of the art and public accessibility). It may well be an order of magnitude easier now than it would have been a few years ago.I don't have any specific advice other than the usual for any large project: start small and focused, then take it one step at a time. Oh, and read lots of academic research papers.Good luck!

评论 #11505459 未加载

评论 #11506402 未加载

评论 #11505039 未加载

hellcowabout 9 years ago

I'm building Abot, an open-source framework that makes it easy and fun to build digital assistants. We try to handle a lot of the heavy lifting you describe above, leaving you to create some cool things! You can see it here: <a href="https://github.com/itsabot/abot" rel="nofollow">https://github.com/itsabot/abot</a>I set out six months ago to do exactly what you describe (build my own digital assistant) but decided it would be more helpful to convert it into a framework and open-source it.I'm happy to help you get started via our IRC or the mailing list, or discuss the challenges we've faced and specific solutions we chose and why.

评论 #11504684 未加载

BinaryIdiotabout 9 years ago

Your challenges are almost immaterial compared with what goes into something like Siri.First you need a way to get input. Is that a text input or do you need voice to text? Google and others can provide a voice to text API to help along here at least at first.Second you need to make that text meaningful. But what the heck is "meaningful" in this context? Generically running an NLP over this is going to get you sentence structure, etc which may or may not be all you need.Essentially you need to take this text and turn it into a graph of decisions which can then be executed.Basically distill the sentence "send a text to Josh saying hi"To a graph that might have a branch like: Action -> Send -> SMS -> ("joshbsmith@yahoo.com", "hi")Categorization is hard. If you can get it reasonable (85% solution) then you'll be pretty good. This is how Siri essentially works.Things to consider:- "How long is the Titanic" is this referring to a movie or the boat?- "Where is Aaron's?" Is this referring to the furniture store chain, a friend in your contact list or even the street name?Source: been looking into this a ton. Wrote this during a hackathon: <a href="http://devpost.com/software/sim" rel="nofollow">http://devpost.com/software/sim</a>. So this is all very possible :)

ecesenaabout 9 years ago

If you'd like to start simple, you could skip the voice interface and go for a text-based interface like Messenger.Voice can be a layer that you put on top later, with the extra complexity that the recognition may not be perfect. I guess it really depends which is the most interesting aspect for you.

评论 #11515155 未加载

doke01about 9 years ago

I would check out Sirius (open source Siri) if I was you:<a href="http://motherboard.vice.com/read/sirius-is-the-google-backed-open-source-siri?utm_source=howtogeek&utm_medium=email&utm_campaign=newsletter" rel="nofollow">http://motherboard.vice.com/read/sirius-is-the-google-backed...</a><a href="https://github.com/jhauswald/lucida" rel="nofollow">https://github.com/jhauswald/lucida</a><a href="http://sirius.clarity-lab.org/" rel="nofollow">http://sirius.clarity-lab.org/</a>

dudurochaabout 9 years ago

If George Hotz came in HN and asked about the feasability of doing a self driving car by himself, people would say the same thing, tell him that was impossible. Lucky for us he neither asked HN if he could hack the iphone or the playstation 3 because if he did he could be desencouraged.Anyway, geohot built a self-driving car and received investment from a16z to hire more people and expand it further. <a href="http://www.bloomberg.com/features/2015-george-hotz-self-driving-car/" rel="nofollow">http://www.bloomberg.com/features/2015-george-hotz-self-driv...</a>My advice to you is study really hard and make it happen. use other people work to leverage your software and go faster. And after that show me and the others what have you made.

falcolasabout 9 years ago

You can offload both 1 and 2 to third party services - one of the major ones with a properly public API that officially supports (and thus charges for high volume use) use by other parties is Microsoft's Project Oxford [0].You could also use some of the Google APIs, but those are more-or-less unsupported, and subject to change as Google needs to change them.From there, it's a matter of transforming intent into action.It's entirely possible to do, but you'll have a lot of learning to do to implement it. One place you might start looking is at the Microsoft Python blog [1].[0] <a href="https://www.microsoft.com/cognitive-services/" rel="nofollow">https://www.microsoft.com/cognitive-services/</a> [1] <a href="https://blogs.msdn.microsoft.com/pythonengineering/2016/02/15/talking-with-python-fabrikam-pizza-shack/" rel="nofollow">https://blogs.msdn.microsoft.com/pythonengineering/2016/02/1...</a>

评论 #11506517 未加载

ttnabout 9 years ago

I was one of those who wanted to build their own Jarvis. If you want it to do certain -pre-defined- tasks like opening Facebook or a website etc., it is not that hard. However, if you want to create something like Siri, you need to use almost everything humanity has in terms of AI/ML until today.

评论 #11504468 未加载

评论 #11504346 未加载

fourdoorsaloonabout 9 years ago

1. are there any other challenge that I have not considered?Always. NLP is a valid way to accept requests. Web crawling is a valid way to get data into the system. How do you tie these two together? How is the data stored? How do create relationships between the data in your system?2. is the project feasible?You have not yet defined your project beyond "like Siri or Cortana". What does that mean to you? How will you know when that has been accomplished? Can you first define a more limited scope for testing the feasibility of the different parts of your system?3. will it succeed?First you must define success.

supericeabout 9 years ago

At my company we did it reasonably well, we just released our product to the market. We do have the advantage of having an app-based model, which means that every app the user installs on our product improves the speech recognition and the amount of actions installed.Getting started is not that hard, getting good is. It's a hard problem to parse speech correctly, take numbers for example: Nineteen-Eightyfour can be parsed as 19 and 84, or as 1948, or as 9 10 80 4. There are challenges, certainly, but creating a simple program with things like wit.ai is do-able. Prepare to write a lot of speech parsing logic, and implement every piece of functionality by hand. Magic as Siri, Google Now, and Cortana may seem, most of it is just hardcoded responses and actions. That need not be a problem, but I can promise you, smart assistants will lose a lot of their magic once your realize it's just a bunch of responses and actions hard coded.Anyway, I don't want to discourage you from trying, because it's really interesting to try and see what challenges you're going to encounter. The getting started pack for speech recognition is mostly wit.ai, or the Google STT engine. Keep in mind: none of the big companies are doing everything from scratch. Sure, Google has their own speech recognition, but recognizing the trigger phrase (jargon for 'OK Google', or 'Hey Siri') is outsourced. Every piece of software that has such a trigger uses the same library. Remember: using libraries is not cheating, it's just focusing on your core task, which is writing the parsing and actions.

pjc50about 9 years ago

"I would like to build a space programme in my back garden, but I'm wondering about its feasibility and I don't have a background in rocketry"Yes, no and no.

评论 #11505161 未加载

评论 #11504796 未加载

评论 #11504702 未加载

评论 #11506948 未加载

larryfreemanabout 9 years ago

The possibility of deep NLP suggest that there are tremendous opportunities for building a personal assistant. For Deep NLP, I would suggest course CS224D at Stanford on YouTube (going on now but also delivered last year: <a href="https://www.youtube.com/watch?v=kZteabVD8sU" rel="nofollow">https://www.youtube.com/watch?v=kZteabVD8sU</a>For Deep NLP, you will need to be solid on linear algebra and machine learning. For introduction to machine learning, check out Andrew Ng at Coursera: <a href="https://www.coursera.org/learn/machine-learning" rel="nofollow">https://www.coursera.org/learn/machine-learning</a> and my favorite talks on Linear Algebra are the ones done by Gilbert Strang: <a href="http://ocw.mit.edu/courses/mathematics/18-06-linear-algebra-spring-2010/video-lectures/" rel="nofollow">http://ocw.mit.edu/courses/mathematics/18-06-linear-algebra-...</a>For web crawling, there are plenty of open source libraries. If you are not familiar with it, check out the Common Crawl: <a href="http://commoncrawl.org/" rel="nofollow">http://commoncrawl.org/</a> This is a great source of data to crawl.If you focus solely on NLP and data from the Common Crawl (or even Wikipedia), then you will see where you stand as the smoke clears and you feel comfortable with the state of the art techniques.Ignore all the naysayers. The good news is that it has never been easier to get started in deep NLP. Once you have the experience of training a model and seeing how well it works, you can decide on the next steps. Perhaps, you will find a niche that is not well-served that you can go after as a first step of a personal assistant.Good luck.

blcArmadilloabout 9 years ago

To get an idea of where to start you could check out Sirius [1] which is "an open end-to-end standalone speech and vision based intelligent personal assistant (IPA) similar to Apple’s Siri, Google’s Google Now, Microsoft’s Cortana, and Amazon’s Echo."[1] <a href="http://sirius.clarity-lab.org/" rel="nofollow">http://sirius.clarity-lab.org/</a>

galistocaabout 9 years ago

In this specific case, I would start by trying to come up with an idea for a solution to an existing problem instead of trying to build a solution looking for a problem. Trying to build just another Siri will be wasting your time because you're just one person competing with pretty much all the big tech companies, whereas if you try to build a useful tech for a specific use case and people like it, then you can build your platform on top of it. I've never seen any small startup succeed by trying to build an ambitious "platform" from scratch. Most successful platform companies started out as small application providers and went from there. Some people are encouraging you to do it because that's how you learn things, but I would rather make something people like AND learn something new at the same time, instead of spending the same amount of time working on something that will never see the light of the day.

bpyneabout 9 years ago

Take a few minutes to discern your motivation.If you're motivated by learning, you'll certainly do so. You'll probably learn about topics you hadn't considered learning about. Start small and build incrementally. Be patient, you're in for a lot of reading.If you're motivated by profit, i.e. starting a company, then you need to be strategic. The Big Guys - Apple, Google, MS, etc. - are mostly looking at "general purpose" assistants. Keep in mind that they're putting massive effort into these assistants because the technology will become a central part of products like automobiles.Perhaps you can find a niche market they're missing so that you can target your research and coding efforts.As others said, pick a small, focused goal in building an assistant and add goals as you succeed.

V-2about 9 years ago

For what it's worth, the BotBuilder from Microsoft is open-source: <a href="https://github.com/Microsoft/BotBuilder" rel="nofollow">https://github.com/Microsoft/BotBuilder</a> - could be worth a peek.As the website states, this SDK is only "one of three main components of the Microsoft Bot Framework", and I'm not sure if the other two are open-source, but it shouldn't be all that significant given that it's just integration with Microsoft products, like Office 365, plus the bot directory, which I imagine consists of some ready-to-use bots one could possibly piggyback on.This is not voice recognition, but once you're past that stage what you have on your hands is a piece of text anyway - which has to be parsed and processed.

评论 #11505301 未加载

评论 #11505848 未加载

raitomabout 9 years ago

If you dont' want to start from scratch you can use <a href="http://api.ai" rel="nofollow">http://api.ai</a> or <a href="http://wit.ai" rel="nofollow">http://wit.ai</a>

ameliusabout 9 years ago

The biggest challenge is probably covering all the corner cases. In this regard, building such a thing is more about "caring/nursing" than it is about "engineering".But of course, there are some engineering challenges as well, although you might want to use existing solutions (open source or APIs) for that. For speech-to-text you can find many APIs. Here's a nice demo for doing NLP: <a href="http://nlp.stanford.edu:8080/corenlp/" rel="nofollow">http://nlp.stanford.edu:8080/corenlp/</a>

kleibaabout 9 years ago

That mostly depends on what scale you have in mind.If you simply want to make an app for your own personal use, and you imagine a restricted form of dialog (by which I mean e.g. "query/reply" or "command"-type of dialogs as opposed to open discussions) to trigger a limited set of actions (say, verbatim web searches, control the built-in functionality of a smart phone, etc.) it is feasible.But that doesn't mean it's easy. But for a project to hack on, why not?The good news is that there are a lot of tools that can do some of the heavy lifting for you, especially if you restrict yourself to English. You are right that the situation for other languages is not quite as luxurious, but there are tools (of varying quality) for other languages as well, as specially Western European languages.However, because it's a complex subject matter, expect that you might need to first dig into some linguistic and/or NLP theory in order to get the most out of these tools.For instance, the Kaldi speech recognition toolkit is a state-of-the-art research software for automatic speech recognition (ASR), and it's open source. The thing is, to get really good recognition results, you might need to train your own acoustic and language models. Hence, you'd need to learn about these things.For NLU (natural language understanding) there are also a bunch of free software packages available; however, they often follow completely different philosophies and goals. Thus, in order to make an informed decision which one would be the best for you, you'd again have to be prepared to do some reading.One quite user-friendly service for NLU you might want to check out is wit.ai which was acquired by Facebook last year. They focus on setting the entrance barrier really low for the task of turning spoken input into a domain representation. For example, you can quite easily define rules that turn the utterance "turn down the radio please" into a symbolic representation that you can use in your downstream processing. The big plus here is that they do the ASR for you, so you don't have to worry about that.If you prefer to have more control over your tool chain, there are a wide variety of scripting languages that you can use to get your feet wet. AIML is sort of popular for writing bots, but it's quite limited and you have to write rules in XML. VoiceXML is a standard that is great for form-filling applications, ie., situations where your system needs to elicit a specific set of information that's required to run a task. A classic example would be traveling: for your system to find a flight for you, it needs to know (a) point of departure, (b) destination, (c) preferred date and time, (perhaps others). So you need to tell the system, or it has to ask about this information.There are also domain-specific languages like Platon (<a href="https://github.com/uds-lsv/platon" rel="nofollow">https://github.com/uds-lsv/platon</a>) that, again, give you more control but also try to make it quite easy to write a simple application.A next aspect more complex dialog systems typically care about is what the intent of a specific user utterance is. Say, you ask your personal assistant: "do you know when the next bus comes?", you don't want it to answer "yes". That's because your (what is called) "dialog act" was not a yes-no-question, but a request for information. So, you might want to care about how to detect the correct dialog act. Well, first you might want to care about what kinds of dialog acts there are and which of those your system should be prepared to handle.There are many different dialog act sets developed for different domains and situations. There's also an ISO standard (ISO 24617-2) that defines such a set, but then you'd go into more advanced areas again.Next, say your system has done all of the above processing, recognized speech, analyzed the meaning, etc. -- now your system has to make the next move! So how does it decide what's the best reaction? What's by some considered the state-of-the-art for dialog management these days runs under the label POMDP -- Partially Observable Markov Decision Processes. These are systems that learn the best strategy on how to behave from data, typically using reinforcement learning. But you also still have the more traditional approaches in which an "expert" (in this case: you) authors the dialog behavior somehow, and there are tools for that as well.But again, the more simple languages mentioned above like, e.g. Platon etc., also cover this in a way, so don't get discouraged just because you've never even heard of POMDPs so far, nor do you have a large data set that is required for the machine learning part: like with all of the different tasks here, there's always alternatives.Once your assistant has made up its mind about what to do and what to say, you need to turn that into an actual utterance, right? If you just want to start, having a large-ish set of canned sentences that you simply need to select from can get you a long way. The next step would be to insert some variables into those canned sentences that your system can fill depending on the situation. That's called template-based natural language generation (NLG). More recently, machine learning has also been applied with some success to NLG, but that's (a) still researchy and (b) not even necessary for a first dab into writing a dialog system.Unless you just want to display the system utterance on the screen, you'd finally need to use some text-to-speech (TTS) component to vocalize the system utterance. There are some free options, such as Festival or MaryTTS, but unfortunately, they don't quite reach the quality of commercial solutions yet. But hey, who cares, right?One topic I haven't talked about at all yet is uncertainty. Typically, a lot of the steps on the input side of a dialog system use probabilistic approaches because, starting from the audio signal, there's inevitably noise in the input and so the outputs produced on the input side should always be taken with a grain of salt. For ASR, you can often get not just one recognized utterance, but a whole list of hypotheses on what it was the user actually said. Each of these alternatives might come with a confidence score.That, of course, has implications on all the processing that comes afterwards.Now, I've written a whole lot -- and yet, there's so much more I haven't touched yet, such as e.g. prosody processing, multimodality (e.g., using (touch) gestures together with speech), handling of speech disfluencies, barge-in, etc.But I think that shouldn't keep you from just giving it a try. You don't have to write a Siri clone in one weekend. Just like the first video game you write doesn't have to be the next "Last of Us". You can start with Pac-Man just fine, and likewise you can write your first small voice-based assistant that cannot do half the stuff Siri can, and yet have a great time.

评论 #11506534 未加载

rocgfabout 9 years ago

Apple and Microsoft are investing many millions of dollars [citation needed] in this, there's just no way that you can compete. Not that miracles don't happen and it's only huge corporations that build amazing things, but statistically speaking, the answer is a clear "No, it's not feasible".If you, however, want to learn a lot, then it's a definite "Go for it".

facepalmabout 9 years ago

I think the hardest part might be getting lots of context information. Google has a huge database, you probably don't. And Google has access to your calendar, your contacts, your search requests, your location history - and the same for your contacts. They can infer a lot from that.

leemalmacabout 9 years ago

The most epic challenge is to start and get back to work when thing will go wrong. Do it, you'll learn a lot. It will be hard. Post it on Github, ask questions on SO and I hope to see it here as 'Show HN: My f*ckn awesome personal assistant' =)

kespindlerabout 9 years ago

Email me kespindler at gmail. I have relevant work that you'd be interested to hear about.

wslhabout 9 years ago

Re 2) Speech Recognition: it seems like using the new Google Speech API is the way to go but I wonder if you can obtain access or if it is very expensive for small companies. I applied a few weeks ago and haven't heard anything from them.

engineer_steveabout 9 years ago

What is the scoope for your project?? I mean, emmbedded system this is?? (Hardware/software). Do you think running on dsp/arduino/raspberry?? Exist a lot of doubts. Im working in something like that. Good luck the road is long.

aryamaanabout 9 years ago

Hey, I am also thinking to work on something similar (as a side project). Do let me know if you want to collaborate.

discordanceabout 9 years ago

It's tricky but there are some nice ways to check the water quickly. Check out LUIS - <a href="https://www.luis.ai" rel="nofollow">https://www.luis.ai</a>

lumberjackabout 9 years ago

<a href="https://en.wikipedia.org/wiki/Siri#Supported_languages" rel="nofollow">https://en.wikipedia.org/wiki/Siri#Supported_languages</a>

sidcoolabout 9 years ago

To add a few, in addition to the ones you mentioned: 1. Infrastructure 2. Scaling 3. Huge datasets for machine learning

coreyp_1about 9 years ago

1. Do you know what you are doing? (not trying to be harsh... it's a serious question.)2. Have you read any of the latest research in this area? (think Google scholar).3. Often, apps like siri and cortana do not process the sound on the device, but rather send it to a server for processing. There is, of course, exception in cases of small, well-defined applications with limited pattern matching required.4. Whether or not it is feasible and/or succeed is based on your ability, performance, and approach. We cannot predict any of that here.

afrancisabout 9 years ago

Hi Mandela:By Siri, I am assuming your mean some application that understands speech and carries out commands like answering queries, or carrying out a command? Answering queries is specific domain of NLP. As is developing the acoustic model (roughly the part that links the audio with linguistic units). You will find that there are many sub-problems to tackle. You have to decide on which problems to tackle.I have very limited experience in this field. If I didn't try out Nuance NLU/MIX at a hackathon a few months ago (and get a lot of help from the Nuance representatives), I would have never taken a stab at writing a voice application.I am currently building a simple request/response system. Requests of a "HAL, open the pod bay door" nature. As opposed to something more free form and conversational. One has to start somewhere.2. Voice recognition: same as the above.As many posters have commented, there are many speech recognition and synthesis systems out there that handle the acoustic model. Some of the system handle languages other than English.I have tried the Amazon Alexa SDK, Wit.ai and Nuance NLU/MIX. For the most part, they are essentially a client/server model.1 - At "compile" time, one builds up an acoustic model/language model with samples (at least this is how Nuance works). Or maybe the language model already exists and training through additional use makes it better.2- At "runtime," a client such as a mobile application interacts with the speech recognition system, via an API. Audio is sent. The speech system parses the audio stream and returns some data structure (or audio if speech synthesis is done). Or in a slight variation, the speech system sends the AST to an end-point for back-end processing. Most of your application will be the back-end that does the meaningful domain specific stuff. For instance, in Alexa, one is developing "Skills"One of the things I am learning is that there is not a clear cut distinction between what is in the language model and in the back-end. Although the language model allows it, I don't necessarily want to bake in business logic (i.e. Pod bay Door maps onto part number 42). Also the back-end may have to do additional NLP related processing ("Cod bay" is probably "Pod bay").NLP: I am not aware of any non-English library and don't have much background on the subject.I found the Stanford course cs224n to be super useful (<a href="https://www.coursera.org/course/nlp" rel="nofollow">https://www.coursera.org/course/nlp</a>). There is a book (Speech and Language Processing) associated with the course.The Peter Norvig paper "How to write a Spell Corrector" (<a href="http://norvig.com/spell-correct.html" rel="nofollow">http://norvig.com/spell-correct.html</a>) is very helpful.Since I am working with Python, I use the book "Natural Language Processing with Python" by Bird, Klein, and Loper and the Python NLTK.Finally I find that reading the documentation associated with the various speech systems to be very helpful. Vary to suit your needs. For instance, Alexa provides guidelines for U/X like Voice Design Best Practices (<a href="https://developer.amazon.com/public/solutions/alexa/alexa-skills-kit/docs/alexa-skills-kit-voice-design-best-practices" rel="nofollow">https://developer.amazon.com/public/solutions/alexa/alexa-sk...</a>).3. Web crawling: there are tons of libraries for doing crawling and I have a decent understanding of the subject.I think crawling is the least of your problems. Look at Week 8, Information Retrieval of the Stanford NLP Course.Mandela, at this point all I can say is go for it! Look at the speech SDKs out there and pick one, preferably one with a large community. Build something small. See if you can work in a computer language and system you are comfortable with (With Nuance, I could work with mostly with Python and avoid Android and Java).Have fun!!!