That mostly depends on what scale you have in mind.<p>If you simply want to make an app for your own personal use, and you imagine a restricted form of dialog (by which I mean e.g. "query/reply" or "command"-type of dialogs as opposed to open discussions) to trigger a limited set of actions (say, verbatim web searches, control the built-in functionality of a smart phone, etc.) it is feasible.<p>But that doesn't mean it's easy. But for a project to hack on, why not?<p>The good news is that there are a lot of tools that can do some of the heavy lifting for you, especially if you restrict yourself to English. You are right that the situation for other languages is not quite as luxurious, but there are tools (of varying quality) for other languages as well, as specially Western European languages.<p>However, because it's a complex subject matter, expect that you might need to first dig into some linguistic and/or NLP theory in order to get the most out of these tools.<p>For instance, the Kaldi speech recognition toolkit is a state-of-the-art research software for automatic speech recognition (ASR), and it's open source. The thing is, to get really good recognition results, you might need to train your own acoustic and language models. Hence, you'd need to learn about these things.<p>For NLU (natural language understanding) there are also a bunch of free software packages available; however, they often follow completely different philosophies and goals. Thus, in order to make an informed decision which one would be the best for you, you'd again have to be prepared to do some reading.<p>One quite user-friendly service for NLU you might want to check out is wit.ai which was acquired by Facebook last year. They focus on setting the entrance barrier really low for the task of turning spoken input into a domain representation. For example, you can quite easily define rules that turn the utterance "turn down the radio please" into a symbolic representation that you can use in your downstream processing. The big plus here is that they do the ASR for you, so you don't have to worry about that.<p>If you prefer to have more control over your tool chain, there are a wide variety of scripting languages that you can use to get your feet wet. AIML is sort of popular for writing bots, but it's quite limited and you have to write rules in XML. VoiceXML is a standard that is great for form-filling applications, ie., situations where your system needs to elicit a specific set of information that's required to run a task. A classic example would be traveling: for your system to find a flight for you, it needs to know (a) point of departure, (b) destination, (c) preferred date and time, (perhaps others). So you need to tell the system, or it has to ask about this information.<p>There are also domain-specific languages like Platon (<a href="https://github.com/uds-lsv/platon" rel="nofollow">https://github.com/uds-lsv/platon</a>) that, again, give you more control but also try to make it quite easy to write a simple application.<p>A next aspect more complex dialog systems typically care about is what the intent of a specific user utterance is. Say, you ask your personal assistant: "do you know when the next bus comes?", you don't want it to answer "yes". That's because your (what is called) "dialog act" was not a yes-no-question, but a request for information. So, you might want to care about how to detect the correct dialog act. Well, first you might want to care about what kinds of dialog acts there are and which of those your system should be prepared to handle.<p>There are many different dialog act sets developed for different domains and situations. There's also an ISO standard (ISO 24617-2) that defines such a set, but then you'd go into more advanced areas again.<p>Next, say your system has done all of the above processing, recognized speech, analyzed the meaning, etc. -- now your system has to make the next move! So how does it decide what's the best reaction? What's by some considered the state-of-the-art for dialog management these days runs under the label POMDP -- Partially Observable Markov Decision Processes. These are systems that learn the best strategy on how to behave from data, typically using reinforcement learning. But you also still have the more traditional approaches in which an "expert" (in this case: you) authors the dialog behavior somehow, and there are tools for that as well.<p>But again, the more simple languages mentioned above like, e.g. Platon etc., also cover this in a way, so don't get discouraged just because you've never even heard of POMDPs so far, nor do you have a large data set that is required for the machine learning part: like with all of the different tasks here, there's always alternatives.<p>Once your assistant has made up its mind about what to do and what to say, you need to turn that into an actual utterance, right? If you just want to start, having a large-ish set of canned sentences that you simply need to select from can get you a long way. The next step would be to insert some variables into those canned sentences that your system can fill depending on the situation. That's called template-based natural language generation (NLG). More recently, machine learning has also been applied with some success to NLG, but that's (a) still researchy and (b) not even necessary for a first dab into writing a dialog system.<p>Unless you just want to display the system utterance on the screen, you'd finally need to use some text-to-speech (TTS) component to vocalize the system utterance. There are some free options, such as Festival or MaryTTS, but unfortunately, they don't quite reach the quality of commercial solutions yet. But hey, who cares, right?<p>One topic I haven't talked about at all yet is uncertainty. Typically, a lot of the steps on the input side of a dialog system use probabilistic approaches because, starting from the audio signal, there's inevitably noise in the input and so the outputs produced on the input side should always be taken with a grain of salt. For ASR, you can often get not just one recognized utterance, but a whole list of hypotheses on what it was the user actually said. Each of these alternatives might come with a confidence score.<p>That, of course, has implications on all the processing that comes afterwards.<p>Now, I've written a whole lot -- and yet, there's so much more I haven't touched yet, such as e.g. prosody processing, multimodality (e.g., using (touch) gestures together with speech), handling of speech disfluencies, barge-in, etc.<p>But I think that shouldn't keep you from just giving it a try. You don't have to write a Siri clone in one weekend. Just like the first video game you write doesn't have to be the next "Last of Us". You can start with Pac-Man just fine, and likewise you can write your first small voice-based assistant that cannot do half the stuff Siri can, and yet have a great time.