Introduction

Voice control tags are extensible markup language (XML) tags that can be used to customize SAPI (Speech API) text-to-speech. The following supported voice control tags can be used to help make the audio in WIN-911 better fit your needs. For example, use the <Silence> tag to create a pause for so many seconds, or use the <Spell> tag to make the text-to-speech spell out the word. The following section explains in detail all the supported voice control tags. It also gives examples that may be used in WIN-911 to get a better understanding of what the tags actually do. The section labeled ‘Implementing Voice Control Tags’ immediately following the next section explains in detail how these voice control tags can be applied in WIN-911.

NOTE: Voice control tags are NOT compatible with WIN-911 ‘Runtime Voice Synthesis’. Wave files must be created using WIN-911’s ‘Text-to-Speech Wave Files’.

 

Supported Voice Control Tags

 

Volume:

The Volume tag controls the volume of a voice. The tag can be empty, in which case it applies to all subsequent text, or it can have content, in which case it only applies to that content. The Volume tag has one required attribute: Level. The value of this attribute should be an integer between zero and one hundred. Values outside of this range will be truncated.

<volume level = “50”>
This text should be spoken at volume level fifty.
</volume><volume level = “100”>
This text should be spoken at volume level one hundred.
</volume><volume level = “40”/>
All text which follows should be spoken at volume level forty.

 

One hundred represents the default volume of a voice. Lower values represent percentages of this default. That is, 50 corresponds to 50% of full volume.

 

Rate:

The Rate tag controls the rate of a voice. The tag can be empty, in which case it applies to all subsequent text, or it can have content, in which case it only applies to that content. The Rate tag has two attributes, Speed and AbsSpeed, one of which must be present. The value of these attributes should be an integer between negative ten and ten. Values outside of this range may be truncated by the engine (but are not truncated by SAPI). The AbsSpeed attribute controls the absolute rate of the voice, so a value of ten always corresponds to a value of ten; a value of five always corresponds to a value of five.


<rate absspeed = “5”>
This text should be spoken at rate five.
</rate><rate absspeed = “-5”>
This text should be spoken at rate negative five.
</rate><rate absspeed = “10”/>
All text which follows should be spoken at rate ten.

 

 

Speed:

The Speed attribute controls the relative rate of the voice. The absolute value is found by adding each Speed to the current absolute value. The value of this attribute should be an integer between negative twenty and ten.


<rate speed = “5”>
This text should be spoken at rate five.
</rate><rate speed = “-5”>
This text should be spoken at rate zero.
</rate>

 

Zero represents the default rate of voice, with positive values being faster and negative values being slower.

 

Pitch:

The Pitch tag controls the pitch of a voice. The tag can be empty, in which case it applies to all subsequent text, or it can have content, in which case it only applies to that content. The Pitch tag has two attributes, Middle and AbsMiddle, one of which must be present. The value of both of these attributes should be an integer between negative ten and ten. Values outside of this range may be truncated by the engine ( but are not truncated by SAPI). The AbsMiddle attribute controls the absolute pitch of the voice, so a value of ten always corresponds to a value of ten, a value of five always corresponds to a value of five.


<pitch absmiddle = “5”>
This text should be spoken at pitch five.
</pitch><pitch absmiddle = “-5”>
This text should be spoken at pitch negative five.
</pitch><pitch absmiddle = “10”/>
All text which follows should be spoken at pitch ten.

 

The Middle attribute controls the relative pitch of the voice. The absolute value is found by adding each Middle to the current absolute value.


<pitch middle = “5”>
This text should be spoken at pitch five.
</pitch><pitch middle = “-5”>
This text should be spoken at pitch zero.
</pitch>

 

Zero represents the default middle pitch for voice, with positive values being higher and negative values being lower.

 

Emph:

The Emph tag instructs the voice to emphasize a word or section of text. The Emph tag cannot be empty. The following word should be emphasized.


<emph>boo </emph>!

 

The method of emphasis may vary from voice to voice.

 

Spell:

The Spell tag forces the voice to spell out all text, rather than using its default word and sentence breaking rules, normalization rules, and so forth. All characters should be expanded to corresponding words (including punctuation, numbers, and so forth). Spell tag cannot be empty.


<spell>
These words should be spelled out.
</spell>
These words should not be spelled out.

 

 

Silence:

The Silence tag inserts a specified number of milliseconds of silence into the output audio stream. This tag must be empty, and must have one attribute, Msec.


Five seconds of silence <silence msec= “5000”/> just occurred.

 

 

Pron:

The Pron tag inserts a specified pronunciation. The voice will process the sequence of phonemes exactly as they are specified. This tag can be empty, or it can have content. If it does have content, it will be interpreted as providing the pronunciation for the enclosed text. That is, the enclosed text will not be processed as it normally would be. The Pron tag has one attribute, Sym, whose value is a string of white space separated phonemes.


<pron sym=“h eh l l ow & w er l l d ”/>
<pron sym=“h eh l l ow & w er l l d “/> hello world</pron>

 

 

PartOfSp:

The ParOfSp tag provides the voice with the part of speech of the enclosed word(s). Use this tag to enable the voice to pronounce a word with multiple pronunciations correctly depending on its part of speech. The PartOfSp tag cannot be empty. The PartOfSp tag has one attribute, Part, which takes a string corresponding to a SAPI part of speech as its attribute. Only SAPI defined parts of speech are supported – “Unknown”, “Noun”, “Verb”, “Modifier”, “Function”, “Interjection”.


<partofsp part= “noun”> A </partofsp> is the first letter of the alphabet
Did you <partofsp part= “verb”> record </partofsp> that <partofsp part= “noun”> record</partofsp>?

 

Note: The PartOfSp tag is not supported by Cepstral’s Premium Voices Diane and David.

 

Context:

The Context tag provides the voice with information which the voice may then use to determine how to normalize special items, like dates, numbers, and currency. Use this tag to enable the voice to distinguish between confusable data formats (see the example, below). The Context tag cannot be empty. The Context tag has one attribute, Id, which takes a string corresponding to the context of the enclosed text. Several contexts are defined by SAPI and are more likely to be recognized by SAPI compliant voices, but any string may be used.


<context id= “date_mdy”> 03/04/01 </context> should be March fourth, two thousand one.
<context id= “date_dmy”> 03/04/01 </context> should be April third, two thousand one.
<context id= “date_ymd”> 03/04/01 </context> should be April first, two thousand three.

 

Note: The Context tag is not supported by Cepstral’s Premium Voices Diane and David.

 

Implementing Voice Control Tags

The following is a tutorial on implementing voice control tags into WIN-911 using the WIN-911 Configurator. In this example we will be adding a 5 second pause in front of a text-to-speech command. The tags are all applied in the same manner, you will need to follow the syntax and guidelines given in the section above for each individual tag.

  1. To begin, open the WIN-911 Configurator.

  2. Click on the Define Common Sounds icon located under the Global Definitions row of the configurator.

  3. Next, click on the button with the text that you would like to edit, in this case, ‘Press star to repeat the message any other key to continue’, is chosen.    
        
  4. If you are using the Runtime Voice Synthesis option with WIN-911 move to step 5 otherwise follow step 4. On the Select Sound File window, click New. You should then be directed to the Convert Text to Wave File window.

  5. Enter the tag in the “Text to be converted:” textbox. In this case, the tag needed to create a 5 second pause before the message is <silence msec= “5000”/>. You can use the Preview button to hear the results of the tag. Note:Sometimes the tag behaves differently in the preview button then from when the text is in use during runtime.

    Note:
     If you are using the Runtime Voice Synthesis option the following window will be titled Save Runtime Text String.       
                                                                                                      
  6. After you have entered the tag into the text, click OK. If you are using the Runtime Voice Synthesis option in WIN-911move to step 7 otherwise continue following step 6. You will then be told ‘A File Name must be entered do you want to use the Text for File Name?’ Choose either Yes to do so or No if you would like to enter a different file name. If the text is too long it will prompt you to create a file name. The WIN-911 Configurator will then mention that ‘You are about to overwrite an existing *.WAV file. Do you want to continue? Choose Yes. Then, you will want to click OK on the Select Sound File window, and then on the Common Sounds window. In order to complete the changes save the configuration and if Scan and Alarm is running, shutdown and restart.

  7. Finish by clicking OK on the Common Sounds window. In order to complete the changes save the configuration and if Scan and Alarm is running, shutdown and restart.


Reference: Microsoft Corporation. “MSDN- XML TTS Tutorial” website. Published on internet 2005 to present http://msdn.microsoft.com/library/default.asp?url=/library/en-us/SAPI51sr/Whitepapers/WP_XML_TTS_Tutorial.asp