(X) Hide this
    • Login
    • Join
      • Generate New Image
        By clicking 'Register' you accept the terms of use .

Windows Phone 8: Voice Commands

(3 votes)
Peter Kuhn
Peter Kuhn
Joined Jan 05, 2011
Articles:   44
Comments:   29
More Articles
0 comments   /   posted on Feb 25, 2013
Categories:   Windows Phone

Part of the last article was a detailed look at the possibility of speech recognition from within your app. A logical continuation of this technology and feature is to seek deeper integration with the operating system by using voice commands. Voice commands are a way for you to register certain phrases with the Windows Phone OS that are recognized when the user invokes the built-in voice recognition, without your app being active or even launched. In this article I will explain what it takes to use this feature, and what you can achieve with it.

Voice Commands Explained

Typically, your application will be used for a limited set of operations, with the same actions executed all the time. Let's say you have created some sort of news reader, then the most common action probably will be that the users launch the app, navigate to the unread items, and then start reading the new available content. If your app searches for points of interest in the local area, users will launch it, navigate to the search page and enter their search term, then look through the results. In all these situations, it can become quite annoying to perform these individual steps over and over again, even if it only requires a handful of taps and swipes. Wouldn't it be great to have shortcuts for this?

One way to create those shortcuts on the phone is to use secondary tiles. These allow the user to pin commonly used pages or functionality of your app to their start screen, which then can be invoked directly, without e.g. going through an application's menu first. Secondary tiles are still quite limited and static though, in the sense that they do not provide a way to pass dynamically changing parameters to your application. This for example is something that can be achieved with voice commands.

Command Structure

As we have learned in the previous part [1] voice recognition is a very complex topic that requires a lot of computational power (which is why it is outsourced to external services), and it also has several implications depending e.g. on the language the user has set their phone to. To provide better recognition results, a common technique hence is to provide fixed phrases that the recognizer can understand more accurately. This is also the way voice commands work. To add more flexibility and make the commands less static, those phrases can be varied in wording slightly, or have variable placeholders. Let's take a look at a typical command as it can be used for voice commands on Windows Phone:


The matching algorithm for voice commands expects an identifier that is used to match commands to a particular app. It's important to make that identifier unique to your app so it cannot be confused with other apps or even built-in voice commands of the phone. Obviously though you would want to make the command your app's title or similar. The actual phrase then is compared to the phrases defined by the app to determine what command should actually be invoked. Let's take a look at the decomposition of that command itself:

phraseAfter matching the command to your app, the amount of possible matches for the phrase has narrowed down tremendously already (only phrases defined in your app will be recognized). To aid further in the recognition process, in this case "poke … gorillas" is a static, fixed part defined for recognition. The only degrees of freedom for the recognizer now are optional words like "and" as well as true dynamic values that can take one of a set of values that you can freely define and change over time. In the following paragraphs I'll explain how this translates into the technical details for Windows Phone developers.

Voice Commands on Windows Phone

To enable your app for voice commands, you need to specify the "ID_CAP_SPEECH_RECOGNITION" capability in your manifest, just as explained in the last part [2]. You don't need the microphone capability until of course you plan on using speech recognition from within your app also.

Once set up, you can start with voice commands right away and very easily by adding a new xml file that holds the required definitions. To get you started, an item template for voice command definitions is provided for you:


The content of a file that can process the above sample command looks like this:

<?xml version="1.0" encoding="utf-8"?>
<VoiceCommands xmlns="http://schemas.microsoft.com/voicecommands/1.0">
  <CommandSet xml:lang="en-US" 
    <CommandPrefix>Gorilla Garden</CommandPrefix>
    <Example>poke 3 gorillas</Example>
    <Command Name="PokeGorilla">
      <Example> poke 3 gorillas </Example>
      <ListenFor> [and] poke {number} gorillas </ListenFor>
      <ListenFor> [and] poke [a] gorilla </ListenFor>
      <Feedback> Poking gorillas... </Feedback>
      <Navigate Target="Gorillas.xaml" />
    <PhraseList Label="number">
      <Item> 1 </Item>
      <Item> 2 </Item>
      <Item> 3 </Item>

One of the important details is to provide a "Name" attribute for your command sets. This unfortunately is not added by default in the template, so make sure to do that manually. If a command set does not have a name, you cannot programmatically access it, and it is also not accounted for when you try to e.g. read the number of command sets that you have already installed on the phone.

Another detail to note is that each command set is associated with a language, in this case US-English. This allows you to specify different commands for all the different languages you expect users to use, which solves some of the basic problems with speech recognition already. Please mind that the number of sets for each language is limited to one, so you have to define all commands for a language in one command set element.

The command prefix definition is used to match a spoken command to your app, just like explained above. The "Example" element just below is used in the UI of the phone when the user starts the voice recognition by holding down the start button of the phone. You may have noticed that there is a question mark button in that default UI. Tap that button and you will be taken to a help screen that e.g. lists all the apps installed on the phone which support voice commands. Here you can read what is written in the "Example" element of your command set:

When you tap the entry of an app here, you are taken to further details, in particular the list of supported commands for this app. Each command in turn can have an "Example" element in the definition file, to give the user samples of the command's usage:

In addition to the example text that is shown in the above UI each command entry can have the following sub elements:

  • "ListenFor": this denotes an actual phrase that is listened for. You can add multiple phrases that differ slightly, for example in the definition file above I've added two versions for singular and plural. Make sure that all phrases are meant to invoke the same logical command, and don't mix in phrases that are actually meant to do something different.
  • "Feedback": when the app is invoked as result of the recognized command, the phone reads back to the user what is defined here.
  • "Navigate": this important element determines what page is being navigated to when your app is launched as a result of a voice command recognition. If you leave out the "Target" attribute, then the configured main page is used. However, you can navigate to sub pages directly (as with secondary tiles, for example), which is used here in the example also.

Phrases can contain two special elements to achieve what has been discussed in theory above:

  • You can add words in square braces to mark them as optional. With the above sample, the command "start Gorilla Garden and poke a gorilla" works equally to simply saying "Gorilla Garden – poke gorilla".
  • By using curly braces, you can add placeholders which are then matched to so-called phrase lists. These are sets of possible values the user can say in place of the markers. In the above sample, a phrase list "{number}" is used. You can also use a special placeholder "{*}" to virtually allow any spoken value, however you won't have access to that value later – it'll only show up as "…" in the recognized result.

I'll return to the topic of phrase lists below, as they play a vital part in making your commands more dynamic.

Registering Voice Commands

Registration of voice commands must be performed from your code. The involved namespace for voice commands is Windows.Phone.Speech.VoiceCommands [3] which currently provides access only to the two classes VoiceCommandService [4] and VoiceCommandSet [5]. The former helps you registering your voice command definition file as well as provide access to already installed (and named) voice command sets. Usually the installation only needs to be performed once, however in certain situations the voice data of your app may be wiped, for example in backup/restore situations. It's no problem to perform the installation e.g. each time your application is started, but you can also try and play a bit nicer, like for example like:

if (VoiceCommandService.InstalledCommandSets.Count == 0)
    await VoiceCommandService.InstallCommandSetsFromFileAsync(new Uri("ms-appx:///SampleVoiceCommands.xml"));

Note the URI argument that points to the definition file in your app. This is enough for registration, commands can be used right away now, and your application and commands will appear in the help UI of the voice recognition feature on the phone, as shown above.

Working with Phrase Lists

One important thing to remember is that the definition of voice commands and phrases cannot be changed dynamically. The only parts that are accessible and can be changed from your code after you've installed the voice command definition file are the phrase lists. These however can be accessed comfortably and e.g. extended whenever you need it. For example, let's say you allow the user to launch your app and automatically display some custom data they created before. This is a typical scenario where static phrase lists won't work, as the potential values highly depend on the content users create. You can then use the following method to update the list whenever required, and instantly enable your voice commands to recognize the changed phrase lists.

VoiceCommandSet defaultCommandSet;
if (VoiceCommandService.InstalledCommandSets.TryGetValue("DefaultCommandSet", out defaultCommandSet))
    IEnumerable<string> newValues = new[] { "1", "2", "3", "4" };
    await defaultCommandSet.UpdatePhraseListAsync("number", newValues);

In the above snippet, I try to access the command set named "DefaultCommandSet" as it has been defined in the definition file before, and then update the "number" phrase list. In the original definition, this list only supported the values 1, 2, and 3, but after the update the voice recognition now also works for the value 4.

Processing Voice Commands

The last missing puzzle piece is how you actually process commands. After all, you want to access in particular the dynamic parts of a command the user has spoken in your app to determine what to do exactly. If you are a bit familiar with Windows Phone you probably are already thinking "navigation arguments", which is spot on. As in many other cases, these arguments are passed to the target page in its query string dictionary.

The following keys for this are well known:

  • voiceCommandName: contains the name of the command that was recognized
  • reco: contains the recognized part, e.g. the phrase defined in the corresponding "ListenFor" element
  • [phrase list name]: contains the value of the recognized phrase list entry

Let's look at a snippet for the gorilla example:

protected override void OnNavigatedTo(NavigationEventArgs e)
    var arguments = NavigationContext.QueryString;
    if (arguments.ContainsKey("voiceCommandName"))
        var voiceCommand = arguments["voiceCommandName"];
        switch (voiceCommand)
            case "PokeGorilla":
                var numberOfGorillas = 1;
                if (arguments.ContainsKey("number"))
                    var number = arguments["number"];
                    int.TryParse(number, out numberOfGorillas);
                // do something with the command data

First I check for available command data. In a second step, I check whether the phrase list value "number" is available. If it's not, then the user has spoken the generic command for a single gorilla poke. Conversion of the "number" argument to a real integer should never fail, as the voice recognition system only passes on values defined by me. For example, even if you desperately try to say "10" or other invalid values for the "number" phrase list, the first value (1) will be returned here.


Even if there probably are quite some users who don't use voice commands on a daily basis or never will even discover that feature on their phones, the effort it takes for a developer to add support for it is minmal. So if it makes sense for your app and you can think of some great shortcuts, I'm sure the power users aware of voice commands will love to see your ideas.



No comments

Add Comment

Login to comment:
  *      *       

From this series