SSML-enhanced input tags give you advanced control over voices generated by SecurityIQ’s Publishing Assistant. This allows you to add pauses, emphasize words and even add effects like whispering and breathing.

Here is a list of popular SSML tags and how you can include them in your next custom module. To avoid generating input errors, you must enclose SSML-enhanced text with a pair of <speak> tags

Action SSML Tag
Identifying SSML-Enhanced Text <speak>
Adding a Pause <break>
Emphasizing Words <emphasis>
Specifying Another Language for Specific Words <lang>
Placing a Custom Tag in Your Text <mark>
Adding a Pause Between Paragraphs <p>
Using Phonetic Pronunciation <phoneme>
Controlling Volume, Speaking Rate, and Pitch <prosody>
Setting a Maximum Duration for Synthesized Speech <prosody amazon:max-duration>
Adding a Pause Between Sentences <s>
Controlling How Special Types of Words Are Spoken <say-as>
Pronouncing Acronyms and Abbreviations <sub>
Improving Pronunciation by Specifying Parts of Speech <w>
Adding the Sound of Breathing <amazon:auto-breaths>
Adding Dynamic Range Compression <amazon:effect name=”drc”>
Speaking Softly <amazon:effect phonation=”soft”>
Controlling Timbre <amazon:effect vocal-tract-length>
Whispering <amazon: effect name=”whispered”>

Identifying SSML-Enhanced Text

The <speak> tag is the root element of all Publishing Assistant SSML text. All SSML-enhanced text must be enclosed within a pair of <speak> tags.

Example:

<speak>Hackers try to steal your information.</speak>

Adding a Pause

A <break> tag will add a pause to your text. You can set a pause based on strength (equivalent to the pause after a comma, a sentence or a paragraph), or you can set it to a specific length of time in seconds or milliseconds. If you choose to not specify duration, Publishing Assistant will use the default length (length of pause after a comma): <break strength=”medium”>

Strength attribute values include:

  • none: No pause. Use none to remove a normally occurring pause, such as after a period
  • x-weak: Has the same strength as none, no pause
  • weak: Sets a pause of the same duration as the pause after a comma
  • medium: Has the same strength as weak
  • strong: Sets a pause of the same duration as the pause after a sentence
  • x-strong: Sets a pause of the same duration as the pause after a paragraph

Time attribute values include:

  • [number]s: The duration of the pause, in seconds. The maximum duration is 10s
  • [number]ms: The duration of the pause, in milliseconds. The maximum duration is 10000ms

Example:

<speak>

Phishing is a cyber attack <break time=”3s”/>where hackers try to steal your information.

</speak>

If you don’t use an attribute with the break tag, the result varies depending on text:

  • If there is no other punctuation next to the break tag, it creates a <break strength=”medium”> (comma-length pause).
  • If the tag is next to a comma, it upgrades the tag to a <break strength=”strong”> (sentence-length pause).
  • If the tag is next to a period, it upgrades the tag to <break strength=”x-strong”> (paragraph-length pause).

Emphasizing Words

Emphasize words using the <emphasis> tag. Emphasizing words changes the speaking rate and volume. More emphasis makes Publishing Assistant speak the text louder and slower. Less emphasis makes it speak quieter and faster. To specify the degree of emphasis, use the level attribute.

The normal speaking rate and volume for a voice falls between the moderate and reduced levels.

Level attribute values include:

  • Strong: Increases the volume and slows the speaking rate so that the speech is louder and slower.
  • Moderate: Increases the volume and slows the speaking rate, but less than strong. Moderate is the default.
  • Reduced: Decreases the volume and speeds up the speaking rate. Speech is softer and faster.

Example:

<speak>

Hackers <emphasis level=”strong”>really</emphasis> like it when you click before you think.

</speak>

Specifying Another Language for Specific Words

You can specify another language for a specific word, phrase or sentence with the <lang> tag. Foreign language words and phrases are generally spoken better when they are enclosed within a pair of <lang> tags. To specify the language, use the xml:lang attribute.

Unless you apply the <lang> tag, all of the words in the input text are spoken in the language of the voice specified in the voice-id. If you apply the <lang> tag, the words are spoken in that language.

For example, if you selected the voice-id Joanna (who speaks U.S. English), Publishing Assistant speaks the following in the Joanna voice without a French accent:

<speak>

Je ne parle pas français.

</speak>

If you use the Joanna voice with the <lang> tag, Publishing Assistant speaks the sentence in the Joanna voice in American-accented French:

<speak>

<lang xml:lang=”fr-FR”>Je ne parle pas français.</lang>.

</speak>

Because Joanna is not a native French voice, pronunciation is based on her native language, U.S. English. For example, although perfect French pronunciation features an uvual trill /R/ in the word français, Joanna’s U.S. English voice pronounces this phoneme as the corresponding sound /r/.

If you use the voice-id of Giorgio, who speaks Italian, with the following text, Publishing Assistant speaks the sentence in Giorgio’s voice with an Italian pronunciation:

<speak>

Mi piace Lady Gaga.

</speak>

If you use the same voice with the following <lang> tag, Publishing Assistant pronounces Bruce Springsteen in Italian-accented English:

<speak>

Mi piace <lang xml:lang=”en-US”>Lady Gaga.</lang>

</speak>

This tag can also be used as a substitute for the optional DefaultLangCode option when synthesizing speech. However, doing so requires SSML text formatting.

Placing a Custom Tag in Your Text

Use the <mark> tag to put a custom tag within your text. Publishing Assistant takes no action on the tag, but returns the location of the tag in the SSML metadata. This tag can be anything you want to call out, as long as it maintains the following format:

<mark name=”tag_name“/>

For example, if the tag name is “information” and the input text is:

<speak>

Hackers try to steal your <mark name=”information”/>login credentials.

</speak>

Publishing Assistant might return the following SSML metadata: {“time”:767,”type”:”ssml”,”start”:25,”end”:46,”value”:”information“}

Adding a Pause Between Paragraphs

Use the <p> tag to add a pause between paragraphs in your text. Using this tag provides a longer pause than native speakers usually place at commas or the end of a sentence. This is equivalent to specifying a pause using <break strength=”x-strong”/>.

Example:

<speak>

<p>This is the first paragraph. There should be a pause after this text is spoken.</p>

<p>This is the second paragraph.</p>

</speak>

Using Phonetic Pronunciation

Use the <phoneme> tag to use phonetic pronunciation for specific text. Two attributes are required with the <phoneme> tag. They indicate the phonetic alphabet used and the phonetic symbols of the corrected pronunciation:

alphabet:

  • ipa— Indicates that the International Phonetic Alphabet (IPA) will be used
  • x-sampa— Indicates that the Extended Speech Assessment Methods Phonetic Alphabet (X-SAMPA) will be used

ph:

Specifies the phonetic symbols for pronunciation.

With the <phoneme> tag, Publishing Assistant uses the pronunciation specified by the ph attribute instead of the standard pronunciation associated with your selected voice/language.

For instance, the word “pecan” can be pronounced two ways. In the following example, “pecan” is assigned a different pronunciation in each line. Publishing Assistant pronounces pecan as specified in the ph attributes, instead of using the default pronunciation:

<speak>

You say, <phoneme alphabet=”ipa” ph=”pɪˈkɑːn”>pecan</phoneme>.

I say, <phoneme alphabet=”ipa” ph=”ˈpi.kæn”>pecan</phoneme>.

</speak>

Controlling Volume, Speaking Rate and Pitch

Use the <prosody> tag to control the volume, rate or pitch of your selected voice. The <prosody> tag has three attributes, each of which has several available values to set the attribute. Each attribute uses the same syntax:

<prosody attribute=”value“></prosody>

volume

  • default: Resets volume to the default level for the current voice.
  • silent, x-soft, soft, medium, loud, x-loud: Sets the volume to a predefined value for the current voice.
  • +ndB, -ndB: Changes volume relative to the current level. A value of +0dB means no change, +6dB means approximately twice the current volume, and -6dB means approximately half the current volume.

Example 1:

<speak>

Sometimes, you need to say things <prosody volume=”loud”>louder. </prosody>               

</speak>

Example 2:

<speak>

Other times, you might want to <prosody volume=”-6dB”>whisper.</prosody>

</speak>

rate

  • x-slow, slow, medium, fast,x-fast. Sets the pitch to a predefined value for the selected voice.
  • n%: A non-negative percentage change in the speaking rate. For example, a value of 100% means no change in speaking rate, a value of 200% means a speaking rate twice the default rate, and a value of 50% means a speaking rate of half the default rate. This value has a range of 20-200%.

Example 1:

<speak>

Grab attention by <prosody rate=”fast”>speeding up the speaking rate of your text.</prosody>                     

</speak>

Example 2:

<speak>

Slow it down <prosody rate=”85%”>to give folks time to understand.</prosody>  

</speak>

pitch:

  • default: Resets pitch to the default level for the current voice
  • x-low, low, medium, high, x-high: Sets the pitch to a predefined value for the current voice
  • +n% or -n%: Adjusts pitch by a relative percentage. For example, a value of +0% means no baseline pitch change, +5% gives a little higher baseline pitch, and -5% results in a little lower baseline pitch

Example 1:

<speak>

Will your employees respond better to speech <prosody pitch=”high”>with a pitch that is higher than normal?</prosody>                     

</speak>

Example 2:

<speak>

Or will the prefer speech <prosody pitch=”-10%”>with a somewhat lower pitch?</prosody>  

</speak>

The <prosody> tag must contain at least one attribute, but can include more within the same tag.

Example:

<speak>

Before you open any email, <prosody volume=”loud” rate=”x-slow”>closely evaluate the sender and subject line.</prosody>

</speak>

You can also nest tags:

<speak>

<prosody rate=”85%”>Combining attributes <prosody pitch=”-10%”>can make this more effective </prosody> as well.</prosody>                  

</speak>

Setting a Maximum Duration for Synthesized Speech

Use the <prosody> tag with the amazon:max-duration attribute to control how long you want speech to take when synthesized.

The duration of synthesized speech varies slightly, depending on the voice you select. This can make it difficult to match synthesized speech with visuals or other activities that require precise timing. This issue is magnified for translation applications because the time it takes to say particular phrases can vary widely with different languages.

The <prosody amazon:max-duration> tag matches synthesized speech to the amount of time you want it to take (the duration).

Example:

<prosody amazon:max-duration=”time duration“>

You can specify duration in either seconds or milliseconds with the <prosody amazon:max-duration> tag:

  • ns: the maximum duration in seconds
  • nms: the maximum duration in milliseconds

Example:

<speak>

<prosody amazon:max-duration=”2s”>

Hackers use email to steal personal information.

</prosody>

</speak>

If the chosen voice or language would normally take longer than that duration, Publishing Assistant speeds up the speech so that it fits into the specified duration.

If the specified duration is longer than it takes to read the text at a normal rate, Publishing Assistant reads the speech normally. It doesn’t slow down the speech or add silence, so the resulting audio is shorter than requested.

Important: Publishing Assistant increases the speed no more than 5 times the normal rate. If text is spoken faster than this, it usually doesn’t make sense. If a speech cannot fit within your specified duration even when sped up to the maximum, the audio will be sped up but will last longer than the specified duration.

You can include a single sentence or multiple sentences within a <prosody amazon:max-duration> tag, and you can use multiple <prosody amazon:max-duration> tags within your text.

Example:

<speak>

<prosody amazon:max-duration=”2400ms”>

Email can be overwhelming.

</prosody>

<break strength=”strong”/>

<prosody amazon:max-duration=”5100ms”>

Making it easy for hackers to hide malicious content in friendly looking messages.

</prosody>

<break strength=”strong”/>

<prosody amazon:max-duration=”8900ms”>

  That’s why we all need to slow down and make sure, regardless of how busy we are, that everyone understands the danger of email-based threats.

</prosody>

</speak>

Using the <prosody amazon:max-duration> tag can increase latency when Publishing Assistant returns synthesized speech. The degree of latency depends on the passage and its length. We recommend using text comprised of relatively short text passages.

Pauses and max-duration

When using max-duration tag, you can still insert pauses within your text. However, Publishing Assistant includes the length of the pause when calculating the maximum duration for speech. Additionally, Publishing Assistant preserves the short pauses that occur where commas and periods are placed within a passage and includes in the maximum duration.

For example, in the following block, the 600 millisecond break and the breaks caused by the commas and periods occur within the 8-second speech:

<speak>

<prosody amazon:max-duration=”8s”>

Email is a powerful way to communicate.

<break time=”600ms”/>

But it does come with risks, like phishing and malware.

</prosody>

</speak>  

Adding a Pause Between Sentences

Use the <s> tag to add a pause between lines or sentences in your text.

  • Ending a sentence with a period (.)
  • Specifying a pause with <break strength=”strong”/>

In the following example, the <s> tag creates a short pause after both the first and second sentences. The final sentence has no <s> tag, but it is also followed by a short pause because it ends with a period.

<speak>

<s>Hacker target users with phishing</s>

<s>which puts your organization at risk</s>

Phishing simulations can help address this.

</speak>  

Controlling How Special Types of Words Are Spoken

Use the <say-as> tag with the interpret-as attribute to tell Publishing Assistant how to say certain characters, words and numbers. This enables you to provide additional context to eliminate any ambiguity on how Publishing Assistant should render the text.

The say-as tag uses one attribute, <interpret-as>, which uses a number of possible available values. Each uses the same syntax:

<say-as interpret-as=”value“>[text to be interpreted]</say-as>

The following values are available with interpret-as:

  • character or spell-out: Spells out each letter of the text, as in a-b-c.
  • cardinal or number: Interprets the numerical text as a cardinal number, as in 1,234.
  • ordinal: Interprets the numerical text as an ordinal number, as in 1,234th.
  • digits: Spells out each digit individually, as in 1-2-3-4.
  • fraction: Interprets the numerical text as a fraction. This works for both common fractions such as 3/20, and mixed fractions, such as 2 ½. See below for more information.
  • unit: Interprets a numerical text as a measurement. The value should be either a number or a fraction followed by a unit with no space in between as in 1/2inch, or by just a unit, as in 1meter.
  • date: Interprets the text as a date. The format of the date must be specified with the format attribute. See below for more information.
  • time: Interprets the numerical text as duration, in minutes and seconds, as in 1’21”.
  • address: Interprets the text as part of a street address.
  • expletive: “Beeps out” the content included within the tag.
  • telephone: Interprets the numerical text as a 7-digit or 10-digit telephone number, as in 2025551212. You can also use this value for handle telephone extensions, as in 2025551212×345. See below for more information.

Note: Currently the telephone option is only available for English language voices.

Fractions

Publishing Assistant interprets values within the say-as tag that have the interpret-as=”fraction” attribute as common fractions. The following is the syntax for fractions:

  • Fraction syntax: cardinal number/cardinal number, such as 2/9.
    • Example: <say-as interpret-as=”fraction”>2/9</say-as> is pronounced “two ninths.”
  • Non-negative mixed number syntax: cardinal number+cardinal number/cardinal number, such as 3+1/2.
    • Example: <say-as interpret-as=”fraction”>3+1/2</say-as> is pronounced “three and a half.”

Note: There must be a “+” between the “3” and the “1/2”. Publishing Assistant doesn’t support a mixed number without the “+”, such as “3 1/2”.

Dates

When interpret-as is set to date, you also need to indicate the format of the date:

<say-as interpret-as=”date” format=”format“>[date]</say-as>

Example:

<speak>

I was born on <say-as interpret-as=”date” format=”dmy”>12-31-1900</say-as>.

</speak>

The following formats can be used with the date attribute.

  • mdy: Month-day-year.
  • dmy: Day-month-year.
  • ymd: Year-month-day.
  • md: Month-day.
  • dm: Day-month.
  • ym: Year-month.
  • my: Month-year.
  • d: Day.
  • m: Month.
  • y: Year.
  • yyyymmdd: Year-month-day. If you use this format, you can make Publishing Assistant skip parts of the date using question marks. This example will be rendered as “September 22nd”:

<say-as interpret-as=”date”>????0922</say-as>

Telephone

Publishing Assistant attempts to interpret your text based on the text’s formatting even without the <say-as> tag. For example, if your text includes “202-555-1212,” it’s interpreted as a 10-digit telephone number and says each digit individually, with a brief pause for each dash. In this case, you don’t need to use <say-as interpret-as=”telephone”>. However, if you provide the text “2025551212” and want Publishing Assistant to say it as a phone number, you would specify <say-as interpret-as=”telephone”>.

The logic for interpreting each element is language-specific. For example, U.S. and UK English differ in how phone numbers are pronounced (in UK English, sequences of the same digit are grouped together, as in “double five” or “triple four”). To see the difference, test the following example with a U.S. voice and with a UK voice:

<speak>

Your number is <say-as interpret-as=”telephone”>2122241555</say-as>

</speak>

Pronouncing Acronyms and Abbreviations

Use the <sub> tag with the alias attribute to substitute a different word (or pronunciation) for selected text such as an acronym or abbreviation. This uses the syntax:

<sub alias=”new word”>abbreviation</sub>

In this example, the name “Mercury” is substituted for the element’s chemical symbol to make the audio content clearer.

<speak>

I don’t like the chemical element <sub alias=”Mercury”>Hg</sub>, because it’s shiny.

</speak>

Improving Pronunciation by Specifying Parts of Speech

Use the <w> tag to customize the pronunciation of words by specifying the word’s part of speech or alternate meaning. This is done using the role attribute in the following syntax:

<w role=”attribute”>text</w>

Role attribute values for part of speech include:

  • amazon:VB: interprets the word as a verb (present simple)
  • amazon:VBD: interprets the word as past tense or as a past participle

For example, depending on its part of speech, the U.S. English pronunciation of the word “read” varies based on the tag:

<speak>

The word <say-as interpret-as=”characters”>read</say-as> may be interpreted as either the present simple form <w role=”amazon:VB”>read</w>, or the past participle form <w role=”amazon:VBD”>read</w>.

</speak>

To specify an alternate meaning:

  • amazon:SENSE_1: uses the non-default sense of the word when present. For example, the noun “bass” is pronounced differently depending on its meaning. The default meaning is the lowest part of the musical range. The alternate meaning is a species of freshwater fish, also called “bass” but pronounced differently. Using <w role=”amazon:SENSE_1″>bass</w> renders the non-default pronunciation (freshwater fish) for the audio text.

Example:

<speak>

Depending on your meaning, the word <say-as interpret-as=”characters”>bass</say-as>may be interpreted as either a musical element: read, or as its alternative meaning, a freshwater fish <w role=”amazon:SENSE_1″>bass</w>.

</speak>

Note: Some languages may have a different selection of supported parts of speech.

Adding the Sound of Breathing

Adding breathing sounds to synthesized speech to make it sound more natural. The <amazon:breath> and <amazon:auto-breaths> tags provide breaths. You have the following options:

  • Manual mode: you set the location, length, and volume of a breath sound within the text
  • Automated mode: Publishing Assistant automatically inserts breathing sounds into the speech output
  • Mixed mode: both you and Publishing Assistant add breathing sounds

Manual Mode

In manual mode, you place the <amazon:breath/> tag in the input text where you want to locate a breath. You can customize the length and volume of breaths with the duration and volume attributes, respectively:

  • duration: Controls the length of the breath. Valid values are: default, x-short, short, medium, long, x-long. The default value is medium.
  • volume: Controls how loud breathing sounds. Valid values are: default, x-soft, soft, medium, loud, x-loud. The default value is medium.

Note: The exact length and volume of each attribute value is dependent on the specific voice used.

To set a breath sound using the defaults, use <amazon:breath/> without attributes.

Example with duration and volume attributes for medium breath:

<speak>

Sometimes you want to insert only <amazon:breath duration=”medium” volume=”x-loud”/>a single breath.

</speak>

Example using default settings:

<speak>

Sometimes you need <amazon:breath/>to insert one or more average breathes <amazon:breath/> so that the text sounds correct.

</speak>

Example using individual breathing sounds within a passage:

<speak>

<amazon:breath duration=”long” volume=”x-loud”/> <prosody rate=”120%”> <prosody volume=”loud”>

Wow! <amazon:breath duration=”long” volume=”loud”/> </prosody> That was a good phish <amazon:breath duration=”medium” volume=”x-loud”/>. I’ve never seen one so well written. </prosody>

</speak>

Automated Mode

In automated mode, you use the <amazon:auto-breaths> tag to tell Publishing Assistant to automatically create breathing noises at appropriate intervals. You can set the frequency of the intervals, their volume and their duration. Place the </amazon:auto-breaths> tag at the beginning of the text that you want to apply automated breathing to and the close the tag at the end.

Note: Unlike the manual mode tag, <amazon:breath/>, the <amazon:auto-breaths> tag requires a closing tag (</amazon:auto-breaths>).

You can use the following optional attributes with the <amazon:auto-breaths> tag:

  • volume: Controls how loud the breathing sounds. Valid values are: default, x-soft, soft, medium, loud, x-loud. The default value is medium.
  • frequency: Controls how often breathing sounds occur in the text. Valid values are: default, x-low, low, medium, high, x-high. The default value is medium.
  • duration: Controls the length of the breath. Valid values are: default, x-short, short, medium, long, x-long. The default value is medium.

By default, the frequency of breathing sounds depends on the input text. However, breathing sounds often occur after commas and periods.

Example using automated mode without optional parameters:

<speak>

<amazon:auto-breaths>SecurityIQ helps you train users how to stay safe online. </amazon:auto-breaths>

</speak>

Example using automated mode with volume control. The unspecified parameters (duration and frequency) are set to the default values (medium):

<speak>

<amazon:auto-breaths volume=”x-soft”>SecurityIQ helps you train users how to stay safe online. </amazon:auto-breaths>

</speak>

Example using automated mode with frequency control. The unspecified parameters (duration and volume) are set to the default values (medium):

<speak>

<amazon:auto-breaths frequency=”x-low”>SecurityIQ helps you train users how to stay safe online.</amazon:auto-breaths>

</speak>

Example using automated mode with multiple parameters. For the unspecified Duration parameter, Publishing Assistant uses the default value (medium).

<speak>

<amazon:auto-breaths volume=”x-loud” frequency=”x-low”>SecurityIQ helps you train users how to stay safe online. </amazon:auto-breaths>

</speak>

Adding Dynamic Range Compression

To enhance the volume of certain sounds in your audio file, use the dynamic range compression (drc) tag.

The drc tag sets a midrange “loudness” threshold for your audio, and increases the volume (the gain) of the sounds around that threshold. It applies the greatest gain increase closest to the threshold, and the gain increase is lessened farther away from the threshold.

This makes the middle-range sounds easier to hear in a noisy environment, which makes the entire audio file clearer.

The drc tag is a Boolean parameter (it’s either present or it isn’t). It uses the syntax:

<amazon:effect name=”drc”> and is closed with </amazon:effect>.

You can use the drc tag with any voice or language supported by Publishing Assistant. You can apply it to an entire section of the recording, or for only a few words. Example:

<speak>

Some audio is difficult to hear in an office, but <amazon:effect name=”drc”> this audio is less difficult to hear in an office.</amazon:effect>

</speak>

Note: When you use “drc” in the amazon:effect syntax, it is case-sensitive.

Using drc With the prosody Volume Tag

To further increase the volume of certain parts of the file, use the drc tag with the prosody volume tag. Combining tags doesn’t affect the settings of the prosody volume tag.

When you use the drc and prosody volume tags together, Publishing Assistant applies the drc tag first, increasing the middle-range sounds (those near the threshold). It then applies the prosody volume tag and further increases the volume of the entire audio track evenly. To use the tags together, nest one inside the other. Example:

<speak>

<prosody volume=”loud”>This text needs to be loud and easy to understand. <amazon:effect name=”drc”>This text also needs to be more understandable in an office.</amazon:effect></prosody>

</speak>

In this text, the prosody volume tag increases the volume of the entire passage to “loud.” The drc tag enhances the volume of the middle-range values in the second sentence.

Note: When using the drc and prosody volume tags together, use standard XML practices for nesting tags.

Speaking Softly

Use the <amazon:effect phonation=”soft”> tag to generate softly-spoken words. This uses the syntax:

<amazon:effect phonation=”soft”>text</amazon:effect>

Example using the Matthew voice:

<speak>

    This is Matthew speaking in my normal voice. <amazon:effect phonation=”soft”>This

    is Matthew speaking in my softer voice.</amazon:effect>

</speak>

Controlling Timbre

Use the vocal-tract-length tag to control the timbre of output speech. This tag changes the length of the speaker’s vocal tract, which sounds like a change in the speaker’s size. When you increase the vocal-tract-length, the speaker sounds physically bigger. When you decrease it, the speaker sounds smaller.

To change timbre, use the following values:

  • +n% or -n%: Adjusts the vocal tract length by a relative percentage change in the current voice. For example, +4% or -2%. Valid values range from +100% to -50%. Values outside this range are clipped. For example, +111% sounds like +100% and -60% sounds like -50%.
  • n%: Changes the vocal tract length to an absolute percentage of the tract length of the current voice. For example, 110% or 75%. An absolute value of 110% is equivalent to a relative value of +10%. An absolute value of 100% is the same as the default value for the current voice.

Example:

<speak>

This is my original voice, without any modifications. <amazon:effect vocal-tract-length=”+15%”> Now, imagine that I am much bigger. </amazon:effect> <amazon:effect vocal-tract-length=”-15%”> Or, much smaller. </amazon:effect> You can also control the timbre of my voice by making minor adjustments. <amazon:effect vocal-tract-length=”+10%”> For example, by making me sound bigger. </amazon:effect><amazon:effect vocal-tract-length=”-10%”> Or, making me sound only somewhat smaller. </amazon:effect>

</speak>

Combining Multiple Tags

You can combine the vocal-tract-length tag with any other SSML tag that is supported by Publishing Assistant. Because timbre (vocal tract length) and pitch are closely connected, you might get the best results by using both the vocal-tract-length and the <prosody pitch> tags. To produce the most realistic voice, we recommend that you use different percentages of change for the two tags. Experiment with various combinations to get the results you want.

Example:

<speak>

The pitch and timbre of a person’s voice are connected in human speech. <amazon:effect vocal-tract-length=”-15%”> If you are going to reduce the vocal tract length, </amazon:effect><amazon:effect vocal-tract-length=”-15%”> <prosody pitch=”+20%”> you might consider increasing the pitch, too. </prosody></amazon:effect> <amazon:effect vocal-tract-length=”+15%”> If you choose to lengthen the vocal tract, </amazon:effect> <amazon:effect vocal-tract-length=”+15%”> <prosody pitch=”-10%”> you might also want to lower the pitch. </prosody></amazon:effect>

</speak>

Whispering

This tag indicates that the input text should be spoken in a whispered voice rather than as normal speech. It uses the following syntax:

<amazon:effect name=”whispered”>text</amazon:effect>

Example:

<speak>

   <amazon:effect name=”whispered”>Hackers are sneaky, </amazon:effect>

    she said, <amazon:effect name=”whispered”>they try to trick you.</amazon:effect>

</speak>

In this case, the synthesized speech spoken by the character is whispered, but the phrase “she said” is spoken in the normal synthesized speech of the selected Publishing Assistant voice.

Note: When generating speech marks for a whispered voice, the audio stream must also include the whispered voice to ensure that the speech marks match the audio stream.

Source: https://docs.aws.amazon.com/polly/latest/dg/supported-ssml.html#speak-tag