git.gag.com Git - debian/freetts/blob - docs/ProgrammerGuide.html

   1 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
   2 <HTML>
   3 <HEAD>
   4         <META HTTP-EQUIV="CONTENT-TYPE" CONTENT="text/html; charset=iso-8859-1">
   5         <TITLE>FreeTTS Programmer's Guide</TITLE>
   6         <META NAME="GENERATOR" CONTENT="StarOffice 6.0  (Solaris Sparc)">
   7         <META NAME="CREATED" CONTENT="20011115;9153400">
   8         <META NAME="CHANGED" CONTENT="20011116;15171800">
   9         <!-- /**
  10  * Copyright (c) 2001 Sun Microsystems, Inc.
  11  *
  12  * See the file "license.terms" for information on usage and
  13  * redistribution of this file, and for a DISCLAIMER OF ALL
  14  * WARRANTIES.
  15  */
  16
  17 -->
  18 </HEAD>
  19 <BODY BGCOLOR="#ffffff">
  20 <CENTER>
  21         <TABLE WIDTH=100% CELLPADDING=2 CELLSPACING=2 BGCOLOR="#ccccee" STYLE="page-break-before: always">
  22                 <TR>
  23                         <TD WIDTH=100%>
  24                                 <H1 ALIGN=CENTER>FreeTTS Programmer's Guide</H1>
  25                         </TD>
  26                 </TR>
  27         </TABLE>
  28 </CENTER>
  29 <TABLE WIDTH=100% BORDER=0 CELLPADDING=2 CELLSPACING=2>
  30         <TR>
  31                 <TD WIDTH=25% VALIGN=TOP BGCOLOR="#eeeeff">
  32                         <P><BR><B>Table of Contents </B><BR><A HREF="#organization">FreeTTS
  33                         Organization </A><BR><A HREF="#objects">Major FreeTTS Objects
  34                         </A><BR><A HREF="#processing">Processing Walkthrough</A> <BR><A HREF="#data">FreeTTS
  35                         Data</A> <BR><A HREF="#code">Code Walkthrough </A>
  36                         <BR><A HREF="#packaging">Voice Packaging</A>
  37                         </P>
  38                         <P><B>Related Documentation </B><BR><A HREF="index.html">FreeTTS
  39                         Overview </A><BR><A HREF="../javadoc/index.html">FreeTTS API </A><BR><A HREF="http://java.sun.com/products/java-media/speech/forDevelopers/jsapi-doc/index.html">The
  40                         Java Speech API (JSAPI) </A><BR><A HREF="http://java.sun.com/products/java-media/speech/forDevelopers/jsapi-guide/index.html">JSAPI
  41                         Programmer's Guide </A><BR><A HREF="http://www.cmuflite.org/">Flite
  42                         </A><BR><A HREF="http://www.speech.cs.cmu.edu/festival/index.html">Festival
  43                         </A>
  44                         </P>
  45                 </TD>
  46                 <TD WIDTH=5%></TD>
  47                 <TD>
  48                         <P><I>What this is </I>- This is an overview of how FreeTTS works
  49                         from a programmer's point of view. It describes the major classes
  50                         and objects used in FreeTTS, provides a data-flow walkthrough of
  51                         FreeTTS as it synthesizes speech, and provides an annotated
  52                         definition of a voice that  serves as an example of how to define
  53                         a new custom voice.
  54                         </P>
  55                         <P><I>What this is not</I> - This is not an API guide to FreeTTS,
  56                         nor is it a tutorial on the fundamentals of speech synthesis.
  57                         </P>
  58                         <P>The FreeTTS package is based upon Flite, a light-weight
  59                         synthesis package developed at CMU. FreeTTS retains the core
  60                         architecture of Flite. Anyone who is familiar with the workings of
  61                         Flite will probably feel comfortable working with FreeTTS. Since
  62                         Flite itself is based upon the Festival speech synthesis system,
  63                         those who are experienced with the Festival package will notice
  64                         some similarities between Festival and FreeTTS.
  65                         </P>
  66                 </TD>
  67         </TR>
  68 </TABLE>
  69 <HR>
  70 <TABLE WIDTH=100% CELLPADDING=2 CELLSPACING=2>
  71         <TR>
  72                 <TD BGCOLOR="#eeeeff">
  73                         <H2><A NAME="organization"></A>FreeTTS Organization
  74                         </H2>
  75                 </TD>
  76         </TR>
  77 </TABLE>
  78 <P>FreeTTS is organized as a number of trees as follows:
  79 </P>
  80 <UL>
  81
  82 <!--
  83         <LI><P STYLE="margin-bottom: 0cm"><B>javax.speech </B>contains the
  84         generic JSAPI speech implementation. Code in this tree is
  85         independent of any speech synthesis system.
  86         </P>
  87 -->
  88         <LI><P STYLE="margin-bottom: 0cm"><B>com.sun.speech.engine </B>contains
  89         support for JSAPI 1.0.   Various packages
  90         can be found in this tree.
  91         </P>
  92         <LI><P><B>com.sun.speech.freetts </B>contains the implementation of
  93         the FreeTTS synthesis engine. The bulk of the code can be found in
  94         this tree.
  95         <B>com.sun.speech.freetts.jsapi</B> package provides the
  96         JSAPI glue code for FreeTTS.
  97         </P>
  98 </UL>
  99 <P>The <B>com.sun.speech.freetts</B> package is broken down further
 100 into sets of sub-packages as follows:
 101 </P>
 102 <UL>
 103         <LI><P STYLE="margin-bottom: 0cm"><B>com.sun.speech.freetts </B>contains
 104         high-level interfaces and classes for FreeTTS. Much non-language and
 105         voice dependent code can be found here.
 106         </P>
 107         <LI><P STYLE="margin-bottom: 0cm"><B>com.sun.speech.freetts.diphone
 108         </B>provides support for diphone encoded speech.
 109         </P>
 110         <LI><P STYLE="margin-bottom: 0cm"><B>com.sun.speech.freetts.clunits
 111         </B>provides support for cluster-unit encoded speech.
 112         </P>
 113         <LI><P STYLE="margin-bottom: 0cm"><B>com.sun.speech.freetts.lexicon
 114         </B>provides definition and implementation of the Lexicon and
 115         Letter-to-Sound Rules.
 116         </P>
 117         <LI><P STYLE="margin-bottom: 0cm"><B>com.sun.speech.freetts.util
 118         </B>provides a set of tools and utilities.
 119         </P>
 120         <LI><P STYLE="margin-bottom: 0cm"><B>com.sun.speech.freetts.audio
 121         </B>provides audio output support.
 122         </P>
 123         <LI><P STYLE="margin-bottom: 0cm"><B>com.sun.speech.freetts.cart
 124         </B><SPAN STYLE="font-weight: medium">p</SPAN>rovides interface and
 125         implementations of several Classification and Regression Trees
 126         (CART).
 127         </P>
 128         <LI><P STYLE="margin-bottom: 0cm"><B>com.sun.speech.freetts.relp
 129         </B>provides support for Residual Excited Linear Predictive (RELP)
 130         decoding of audio samples.
 131         </P>
 132         <LI><P STYLE="margin-bottom: 0cm"><B>com.sun.speech.freetts.en
 133         </B>contains <SPAN STYLE="font-weight: medium">English</SPAN>
 134         specific code.</P>
 135         <LI><P><B>com.sun.speech.freetts.en.us </B><SPAN STYLE="font-weight: medium">contains
 136         </SPAN>US-English specific code.</P>
 137 </UL>
 138 <TABLE WIDTH=100% CELLPADDING=2 CELLSPACING=2>
 139         <TR>
 140                 <TD BGCOLOR="#eeeeff">
 141                         <H2><A NAME="objects"></A>Major FreeTTS Objects
 142                         </H2>
 143                 </TD>
 144         </TR>
 145 </TABLE>
 146 <P>There are a number of objects that work together to perform speech
 147 synthesis.
 148 </P>
 149 <H3><A HREF="../javadoc/com/sun/speech/freetts/FreeTTSSpeakable.html">FreeTTSSpeakable</A></H3>
 150 <P>FreeTTSSpeakable is an interface. Anything that is a source of text
 151 that needs to be spoken with FreeTTS is first converted into a
 152 FreeTTSSpeakable. One implementation of this interface is
 153 FreeTTSSpeakableImpl. This implementation will wrap the most common
 154 input forms (a String, an InputStream, or a JSML XML document) as a
 155 FreeTTSSpeakable. A FreeTTSSpeakable is given to a Voice to be spoken.
 156 </P>
 157 <H3><A HREF="../javadoc/com/sun/speech/freetts/Voice.html">Voice</A></H3>
 158 <P>The Voice is the central processing point for FreeTTS. The Voice
 159 takes as input a FreeTTSSpeakable, translates the text associated with
 160 the FreeTTSSpeakable into speech and generates audio output
 161 corresponding to that speech. The Voice is the primary customization
 162 point for FreeTTS. Language, speaker, and algorithm customizations can
 163 all be performed by extending the Voice. A Voice will accept a
 164 FreeTTSSpeakable via the Voice.speak method and process it as follows:
 165 </P>
 166 <UL>
 167         <LI><P STYLE="margin-bottom: 0cm">The Voice converts a
 168         FreeTTSSpeakable into a series of Utterances. The rules for breaking
 169         a FreeTTSSpeakable into an Utterance is generally language dependent.
 170         For instance, an English Voice may chose to break a FreeTTSSpeakable
 171         into Utterances based upon sentence breaks.
 172         </P>
 173         <LI><P STYLE="margin-bottom: 0cm">As the Voice generates  each
 174         Utterance, a series of  UtteranceProcessors processes the Utterance.
 175         Each Voice defines its own set of UtteranceProcessors. This is the
 176         primary method of customizing Voice behavior. For instance, to
 177         change how units are joined together during the synthesis process, a
 178         Voice would simply supply a new UtteranceProcessor that implements
 179         the new algorithm. Typically each UtteranceProcessor will run in
 180         turn, annotating or modifying the Utterance with information. For
 181         instance, a 'Phrasing' UtteranceProcessor may insert phrase marks
 182         into an Utterance that indicate where a spoken phrase begins. The
 183         Utterance and UtteranceProcessors are described in more detail
 184         below.
 185         </P>
 186         <LI><P>Once all Utterance processing has been applied, the Voice
 187         sends the Utterance to the AudioOutput UtteranceProcessor. The
 188         AudioOutput processor may run in a separate thread to allow
 189         Utterance processing to overlap with audio output, ensuring the
 190         lowest sound latency possible.
 191         </P>
 192 </UL>
 193 <H3><A HREF="../javadoc/com/sun/speech/freetts/VoiceManager.html">VoiceManager</A></H3>
 194 <P>The VoiceManager is the central repository of voices available to
 195 FreeTTS.  To get a voice you can do:
 196 <pre>
 197 VoiceManager voiceManager = VoiceManager.getInstance();
 198
 199 // create a list of new Voice instances
 200 Voice[] voices = voiceManager.getVoices();
 201
 202 // iterate through the list until you find a Voice with the properties
 203 // you want
 204 ...
 205
 206 // allocate the resources for the voice
 207 voices[x].allocate();
 208
 209 </pre>
 210 You can save yourself the chore of iterating through the voices if you
 211 already know the name of the Voice you want by using voiceManager.getVoice()
 212 </P>
 213 <H3><A HREF="../javadoc/com/sun/speech/freetts/Utterance.html">Utterance</A></H3>
 214 <P>The Utterance is the central processing target in FreeTTS. A
 215 FreeTTSSpeakable is broken up into one or more Utterances, processed
 216 by a series of UtteranceProcessors, and finally output as audio. An
 217 Utterance consists of a set of Relations and a set of features called
 218 FeatureSets.
 219 </P>
 220 <H3><A HREF="../javadoc/com/sun/speech/freetts/FeatureSet.html">FeatureSet</A></H3>
 221 <P>A FeatureSet is simply a Name/Value pair. An Utterance can contain
 222 an arbitrary number of FeatureSets. FeatureSets are typically used to
 223 maintain global Utterance information such as volume, pitch and
 224 speaking rate.
 225 </P>
 226 <H3><A HREF="../javadoc/com/sun/speech/freetts/Relation.html">Relation</A></H3>
 227 <P>A Relation is a named list of Items. An Utterance can hold an
 228 arbitrary number of Relations. A typical UtteranceProcessor may
 229 iterate through one Relation and  create a new Relation. For
 230 instance, a word normalization UtteranceProcessor could iterate
 231 through a token Relation and generate a word Relation based upon
 232 token-to-word rules.   A detailed description of the Utterance
 233 processing and how it affects the Relations in an Utterance is
 234 described below.
 235 </P>
 236 <H3><A HREF="../javadoc/com/sun/speech/freetts/Item.html">Item</A></H3>
 237 <P>A Relation is a list of Item objects. An Item contains a set of
 238 Features (as described previously, FeatureSets are merely name/value
 239 pairs). An Item can have a list of daughter Items as well. Items in a
 240 Relation are linked to Items in the same and other Relations. For
 241 instance, the words in a word Relation are linked back to the
 242 corresponding tokens in the token Relation. Similarly, a word in a
 243 word Relation is linked to the previous and next words in the word
 244 Relation. This gives an UtteranceProcessor the capability of easily
 245 traversing from one Item to another.
 246 </P>
 247 <H3><A HREF="../javadoc/com/sun/speech/freetts/UtteranceProcessor.html">UtteranceProcessor</A></H3>
 248 <P STYLE="margin-bottom: 0cm">An UtteranceProcessor is any object
 249 that implements the UtteranceProcessor interface. An
 250 UtteranceProcessor takes as input an Utterance and performs some
 251 operation on the Utterance.
 252 </P>
 253 <P STYLE="margin-bottom: 0cm"><BR>
 254 </P>
 255 <TABLE WIDTH=100% CELLPADDING=2 CELLSPACING=2>
 256         <TR>
 257                 <TD BGCOLOR="#eeeeff">
 258                         <H2><A NAME="processing"></A>Processing Walkthrough</H2>
 259                 </TD>
 260         </TR>
 261 </TABLE>
 262 <P>In this section we will describe the detailed processing performed
 263 by the CMUDiphoneVoice. This voice is an unlimited-domain voice that
 264 uses diphone synthesis to generate speech. It is derived from the
 265 CMUVoice class. The CMUVoice describes the general processing
 266 required for an English voice without specifying how unit selection
 267 and concatenation is performed. Subclasses of the CMUVoice
 268 (CMUDiphoneVoice and CMUClusterUnitVoice) provide this
 269 specialization.
 270 </P>
 271 <P>Processing starts with the <FONT FACE="Courier, sans-serif">speak</FONT>
 272 method found in <FONT FACE="Courier, sans-serif">com.sun.speech.freetts.Voice</FONT>.
 273 The <FONT FACE="Courier, sans-serif">speak</FONT> method performs the
 274 following tasks:
 275 <ul>
 276 <li><a href="#Tokenization"> Tokenization </a>
 277 <li><a href="#TokenToWords"> TokenToWords </a>
 278 <li><a href="#PartOfSpeechTagger"> PartOfSpeechTagger </a>
 279 <li><a href="#Phraser"> Phraser </a>
 280 <li><a href="#Segmenter"> Segmenter </a>
 281 <li><a href="#PauseGenerator"> PauseGenerator </a>
 282 <li><a href="#Intonator"> Intonator </a>
 283 <li><a href="#PostLexicalAnalyzer"> PostLexicalAnalyzer </a>
 284 <li><a href="#Durator"> Durator </a>
 285 <li><a href="#ContourGenerator"> ContourGenerator </a>
 286 <li><a href="#UnitSelector"> UnitSelector </a>
 287 <li><a href="#PitchMarkGenerator"> PitchMarkGenerator </a>
 288 <li><a href="#UnitConcatenator"> UnitConcatenator </a>
 289 </ul>
 290 </P>
 291 <H3><a name="Tokenization"> Tokenization</a></H3>
 292 <P>In this step, the Voice uses the Tokenizer as returned from the
 293 <FONT FACE="Courier, sans-serif">getTokenizer</FONT> method to break
 294 a FreeTTSSpeakable object is into a series of Utterances. Typically,
 295 tokenization is language-specific so each Voice needs to specify
 296 which Tokenizer is to be used by overriding the <FONT FACE="Courier, sans-serif">getTokenizer</FONT>
 297 method. The CMUDiphoneVoice uses the
 298 c<FONT FACE="Courier, sans-serif">om.sun.speech.freetts.en.TokenizerImpl
 299 </FONT>Tokenizer which is designed to parse and tokenize the English
 300 language.
 301 </P>
 302 <P>A Tokenizer breaks an input stream of text into a series of Tokens
 303 defined by the <FONT FACE="Courier, sans-serif">com.sun.speech.freetts.Token
 304 </FONT>class. Typically, a Token represents a single word in the
 305 input stream. Additionally, a Token will include such information as
 306 the surrounding punctuation and whitespace, and the position of the
 307 token in the input stream.
 308 </P>
 309 <P>The English Tokenizer (c<FONT FACE="Courier, sans-serif">om.sun.speech.freetts.en.TokenizerImpl</FONT>)
 310 relies on a set of symbols being defined that specify what characters
 311 are to be considered whitespace and punctuation.
 312 </P>
 313 <P>The Tokenizer defines a method called <FONT FACE="Courier, sans-serif">isBreak</FONT>
 314 that is used to determine when the input stream should be broken and
 315 a new Utterance is generated. For example, the English Tokenizer has
 316 a set of rules to detect an end of sentence. If the current token
 317 should start a new sentence, then the English Tokenizer <FONT FACE="Courier, sans-serif">isBreak</FONT>
 318 method returns true.
 319 </P>
 320 <P>A higher level Tokenizer, FreeTTSSpeakableTokenizer repeatedly
 321 calls the English Tokenizer and places each token into a list. When
 322 the Tokenizer <FONT FACE="Courier, sans-serif">isBreak</FONT> method
 323 indicates that a sentence break has occurred, the Voice creates a new
 324 Utterance with the current list of tokens. The process of generating
 325 and processing Utterances continues until no more tokens remain in
 326 the input.
 327 </P>
 328 <H2 ALIGN=CENTER>Figure 1: The Utterance after Tokenization
 329 </H2>
 330 <P><IMG SRC="images/img0.jpg" NAME="Graphic1" ALIGN=BOTTOM WIDTH=800 HEIGHT=617 BORDER=0>
 331 </P>
 332 <H3>Utterance Processing
 333 </H3>
 334 <P>A Voice maintains a list of UtteranceProcessors. Each Utterance
 335 generated by the tokenization step is run through the
 336 UtteranceProcessors for the Voice. Each processor receives as input
 337 the Utterance that is being processed. The UtteranceProcessor may add
 338 new Relations to the Utterance, add new Items to Relations, or add
 339 new FeatureSets to Items or to the Utterance itself. Often times, a
 340 series of UtteranceProcessors are tightly coupled; one
 341 UtteranceProcessors may add a Relation to an Utterance that is used
 342 by the next.
 343 </P>
 344 <P>CMUVoice sets up most of the UtteranceProcessors used by
 345 CMUDiphoneVoice. CMUVoice provides a number of <FONT FACE="Courier, sans-serif">getXXX</FONT>
 346 methods that return an UtteranceProcessor, such as <FONT FACE="Courier, sans-serif">getUnitSelector</FONT>
 347 and <FONT FACE="Courier, sans-serif">getUnitConcatenator</FONT>.
 348 Sub-classes of CMUVoice override these <FONT FACE="Courier, sans-serif">getXXX</FONT>
 349 methods to customize the processing. For instance, the
 350 CMUDiphoneVoice overrides <FONT FACE="Courier, sans-serif">getUnitSelector
 351 </FONT>to return a DiphoneUnitsSelector.
 352 </P>
 353
 354 <H3>CMUDiphoneVoice Utterance Processing
 355 </H3>
 356 <P>The UtteranceProcessors described in this section are invoked when
 357 the CMUDiphoneVoice processes an Utterance. When processing begins
 358 the Utterance contains the token list and FeatureSets.
 359 </P>
 360 <H4><a name="TokenToWords"> TokenToWords </a>
 361 </H4>
 362 <P>The TokenToWords UtteranceProcessor creates a word Relation from
 363 the token Relation by iterating through the token Relation Item list
 364 and creating one or more words for each token. For most tokens there
 365 is a one to one relationship between words and tokens, in which case
 366 a single word Item is generated for the token item. Other tokens,
 367 such as: &quot;2001&quot; generate multiple words &quot;two thousand
 368 one&quot;. Each word is created as an Item and added to the word
 369 Relation. Additionally, each word Item is added as a daughter to the
 370 corresponding token in the token Relation.
 371 </P>
 372 <P>The main role of TokenToWords is to look for various forms of
 373 numbers and convert them into the corresponding English words.
 374 TokenToWords looks for simple digit strings, comma separated numerals
 375 (such as 1,234,567), ordinal values, years, floating point values,
 376 and exponential notation. TokenToWords uses the JDK 1.4 regular
 377 expression API to perform some classification. In addition a CART
 378 (Classification and Regression Tree) is used to classify numbers as
 379 one of: year, ordinal, cardinal, digits. Refer to <A HREF="#cart">Classification
 380 and Regression Trees</A> for more information on CARTS.
 381 </P>
 382 <H2 ALIGN=CENTER>Figure 2: The Utterance after TokenToWords
 383 </H2>
 384 <P><IMG SRC="images/img1.jpg" NAME="Graphic2" ALIGN=BOTTOM WIDTH=800 HEIGHT=617 BORDER=0>
 385 </P>
 386 <H4><a name="PartOfSpeechTagger"> PartOfSpeechTagger </a>
 387 </H4>
 388 <P>The PartOfSpeechTagger UtteranceProcessor is a place-holder
 389 processor that currently does nothing.
 390 </P>
 391 <H2 ALIGN=CENTER>Figure 3: The Utterance after PartOfSpeechTagger
 392 </H2>
 393 <P><IMG SRC="images/img2.jpg" NAME="Graphic3" ALIGN=BOTTOM WIDTH=800 HEIGHT=617 BORDER=0>
 394 </P>
 395 <H4><A NAME="Phraser"></A>Phraser
 396 </H4>
 397 <P>The Phraser processor creates a phrase Relation in the Utterance.
 398 The phrase Relation represents how the Utterance is to be broken into
 399 phrases when spoken. The phrase Relation consists of an Item marking
 400 the beginning of each phrase in the Utterance. This phrase Item has
 401 as its daughters the list of words that are part of the phrase.
 402 </P>
 403 <P>The Phraser builds the phrase Relation by iterating through the
 404 Word Relation created by the TokenToWords processor. The Phraser uses
 405 a Phrasing CART to determine where the phrase breaks occur and
 406 creates the phrase Items accordingly.
 407 </P>
 408 <H2 ALIGN=CENTER>Figure 4: The Utterance after Phraser Processing</H2>
 409 <P><IMG SRC="images/img3.jpg" NAME="Graphic4" ALIGN=BOTTOM WIDTH=800 HEIGHT=617 BORDER=0>
 410 </P>
 411 <H4><a name="Segmenter"> Segmenter </a>
 412 </H4>
 413 <P>The Segmenter is one of the more complex UtteranceProcessors. It
 414 is responsible for determining where syllable breaks occur in the
 415 Utterance. It organizes this information in several new Relations in
 416 the Utterance.
 417 </P>
 418 <P>The Segmenter iterates through each word in the Utterance. For
 419 each word, the Segmenter performs the following steps:
 420 </P>
 421 <UL>
 422         <LI><P STYLE="margin-bottom: 0cm">Retrieves the phones that are
 423         associated with the word from the <A HREF="#lexicon">Lexicon </A>.
 424         Each word is organized in a Relation called &quot;SylStructure&quot;.
 425                 </P>
 426         <LI><P STYLE="margin-bottom: 0cm">Iterates through each phone of the
 427         word, adding the phone to a Relation called &quot;Segment&quot;.
 428         </P>
 429         <LI><P STYLE="margin-bottom: 0cm">Determines where syllable breaks
 430         occur (with help from the lexicon) and notes the syllable break
 431         points in a Relation called &quot;Syllable&quot;
 432         </P>
 433         <LI><P>If the lexicon indicates that a particular phone is stressed,
 434         then the syllable that contains that phone is marked as &quot;stressed&quot;.
 435                 </P>
 436 </UL>
 437 <P>When the Segmenter is finished, three new Relations have been
 438 added to the Utterance that denote the syllable structure and units
 439 for the Utterance.
 440 </P>
 441 <H2 ALIGN=CENTER>Figure 5: The Utterance after Segmenter Processing
 442 </H2>
 443 <P><IMG SRC="images/img4.jpg" NAME="Graphic5" ALIGN=BOTTOM WIDTH=800 HEIGHT=617 BORDER=0>
 444 </P>
 445 <H4><a name="PauseGenerator"> PauseGenerator </a>
 446 </H4>
 447 <P>The PauseGenerator annotates an Utterance with pause information.
 448 It inserts a pause at the beginning of the segment list (thus all
 449 Utterances start with a pause). It then iterates through the phrase
 450 Relation (set up by the <A HREF="#Phraser">Phraser</A>) and inserts a
 451 pause before the first segment of each phrase.
 452 </P>
 453 <H2 ALIGN=CENTER>Figure 6: The Utterance after PauseGenerator
 454 Processing
 455 </H2>
 456 <P><IMG SRC="images/img5.jpg" NAME="Graphic6" ALIGN=BOTTOM WIDTH=800 HEIGHT=617 BORDER=0>
 457 </P>
 458 <H4><a name="Intonator"> Intonator </a>
 459 </H4>
 460 <P>The Intonator processor annotates the syllable Relation of an
 461 Utterances with &quot;accent&quot; and &quot;endtone&quot; features.
 462 A typical application of this uses the ToBI (tones and break indices)
 463 scheme for transcribing intonation and accent in English, developed
 464 by Janet Pierrehumbert and Mary Beckman.
 465 </P>
 466 <P>The intonation is independent of the ToBI annotation: ToBI
 467 annotations are not used by this class, but are merely copied from
 468 the CART result to the &quot;accent&quot; and &quot;endtone&quot;
 469 features of the syllable Relation.
 470 </P>
 471 <P>This processor relies on two <A HREF="#cart">CARTs </A>: an accent
 472 CART and a tone CART. This processor iterates through each syllable
 473 in the syllable relation, applies each CART to the syllable and sets
 474 the accent and endtone features of the Item based upon the results of
 475 the CART processing.
 476 </P>
 477 <H2 ALIGN=CENTER>Figure 7: The Utterance after Intonator Processing
 478 </H2>
 479 <P><IMG SRC="images/img6.jpg" NAME="Graphic7" ALIGN=BOTTOM WIDTH=800 HEIGHT=617 BORDER=0>
 480 </P>
 481 <H4><a name="PostLexicalAnalyzer"> PostLexicalAnalyzer </a>
 482 </H4>
 483 <P>The PostLexicalAnalyzer is responsible for performing any fix ups
 484 before the next phase of processing. For instance, the
 485 CMUDiphoneVoice provides a PostLexicalAnalyzer that performs two
 486 functions:
 487 </P>
 488 <UL>
 489         <LI><P STYLE="margin-bottom: 0cm"><B>Fix AH </B>The diphone data for
 490         the CMUDiphoneVoice does not have any diphone data for the &quot;ah&quot;
 491         diphone. The CMU Lexicon that is used by the CMUDiphoneVoice,
 492         however, contains a number of words that reference the &quot;ah&quot;
 493         diphone. The CMUDiphoneVoice PostLexicalAnalyzer iterates through
 494         all phones in the segment Relation and replaces them with &quot;aa&quot;
 495         diphones.
 496         </P>
 497         <LI><P><B>Fix Apostrophe-S </B>This step iterates through the
 498         segments and looks for words associated with the segments that
 499         contain an apostrophe-s. The processor then inserts a 'schwa'
 500         phoneme in certain cases.
 501         </P>
 502 </UL>
 503 <H4><a name="Durator"> Durator </a>
 504 </H4>
 505 <P>The Durator is responsible for determining the ending time for
 506 each unit in the segment list. The Durator uses a CART to look up the
 507 statistical average duration and standard deviation for each phone
 508 and calculates an exact duration based upon the CART derived
 509 adjustment. Each unit is finally tagged with an &quot;end&quot;
 510 attribute that indicates the time, in seconds, at which the unit
 511 should be completed.
 512 </P>
 513 <H2 ALIGN=CENTER>Figure 8: The Utterance after Durator Processing
 514 </H2>
 515 <P><IMG SRC="images/img7.jpg" NAME="Graphic8" ALIGN=BOTTOM WIDTH=800 HEIGHT=617 BORDER=0>
 516 </P>
 517 <H4><a name="ContourGenerator"> ContourGenerator </a>
 518 </H4>
 519 <P>The ContourGenerator is responsible for calculating the F0
 520 (Fundamental Frequency) curve for an Utterance. The paper: <A HREF="http://citeseer.nj.nec.com/20262.html">Generating
 521 F0 contours from ToBI labels using linear regression </A>by Alan W.
 522 Black, Andrew J. Hunt, describes the techniques used.
 523 </P>
 524 <P>The ContourGenerator creates the &quot;target&quot; Relation and
 525 populates it with target points that mark the time and target
 526 frequency for each segment. The ContourGenerator is driven by a a
 527 file of feature model terms. For example, CMUDiphoneVoice uses
 528 com/sun/speech/freetts/en/us/f0_lr_terms.txt. Here is an excerpt:
 529 </P>
 530 <PRE>Intercept 160.584961 169.183380 169.570374 null
 531 p.p.accent 10.081770 4.923247 3.594771 H*
 532 p.p.accent 3.358613 0.955474 0.432519 !H*
 533 p.p.accent 4.144342 1.193597 0.235664 L+H*
 534 p.accent 32.081028 16.603350 11.214208 H*
 535 p.accent 18.090033 11.665814 9.619350 !H*
 536 p.accent 23.255280 13.063298 9.084690 L+H*
 537 accent 5.221081 34.517868 25.217588 H*
 538 accent 10.159194 22.349655 13.759851 !H*
 539 accent 3.645511 23.551548 17.635193 L+H*
 540 n.accent -5.691933 -1.914945 4.944848 H*
 541 n.accent 8.265606 5.249441 7.398383 !H*
 542 n.accent 0.861427 -1.929947 1.683011 L+H*
 543 n.n.accent -3.785701 -6.147251 -4.335797 H*</PRE><P>
 544 The first column represents the feature name. It is followed by the
 545 starting point, the mid-point and the ending point for the term (in
 546 terms of relative frequency deltas). The final column represents the
 547 ToBI label.
 548 </P>
 549 <P>The ContourGenerator iterates through each syllable in the
 550 Utterance and applies the linear regression model as follows:
 551 </P>
 552 <UL>
 553         <LI><P STYLE="margin-bottom: 0cm">For each entry in the
 554         feature/model/terms table, extract the named feature.
 555         </P>
 556         <LI><P STYLE="margin-bottom: 0cm">Compare the feature value to the
 557         ToBI label as specified in the table.
 558         </P>
 559         <LI><P STYLE="margin-bottom: 0cm">If the features match, then use
 560         the start/midpoint and end to update the curve.
 561         </P>
 562         <LI><P>Add the new target point to the target Relation
 563         </P>
 564 </UL>
 565 <H2 ALIGN=CENTER>Figure 9: The Utterance after ContourGenerator
 566 Processing
 567 </H2>
 568 <P><IMG SRC="images/img8.jpg" NAME="Graphic9" ALIGN=BOTTOM WIDTH=800 HEIGHT=617 BORDER=0>
 569 </P>
 570 <H4><a name="UnitSelector"> UnitSelector </a>
 571 </H4>
 572 <P>The UnitSelector that is used by the CMUDiphoneVoice creates a
 573 Relation in the Utterance called &quot;unit&quot;. This relation
 574 contains Items that represent the diphones for the unit. This
 575 processor iterates through the segment list and builds up diphone
 576 names by assembling two adjacent phone names. The diphone is added to
 577 the unit Relation along with timing information about the diphone.
 578 </P>
 579 <H2 ALIGN=CENTER>Figure 10: The Utterance after UnitSelector
 580 Processing
 581 </H2>
 582 <P><IMG SRC="images/img9.jpg" NAME="Graphic10" ALIGN=BOTTOM WIDTH=800 HEIGHT=617 BORDER=0>
 583 </P>
 584 <H4><a name="PitchMarkGenerator"> PitchMarkGenerator </a>
 585 </H4>
 586 <P>The PitchMarkGenerator is responsible for calculating pitchmarks
 587 for the Utterance. The pitchmarks are generated by iterating through
 588 the target Relation and calculating a slope based upon the desired
 589 time and F0 values for each Item in the target Relation. The
 590 resulting slope is used to calculate a series of target times for
 591 each pitchmark. These target times are stored in an LPCResult object
 592 that is added to the Utterance.
 593 </P>
 594 <P><IMG SRC="images/img10.jpg" NAME="Graphic11" ALIGN=BOTTOM WIDTH=800 HEIGHT=617 BORDER=0>
 595 </P>
 596 <H4><a name="UnitConcatenator"> UnitConcatenator </a>
 597 </H4>
 598 <P>The UnitConcatenator processor is responsible for gathering all of
 599 the diphone data and joining it together. For each Item in the unit
 600 Relation (recall this was the set of diphones) the UnitConcatenator
 601 extracts the unit sample data from the unit based upon the target
 602 times as stored in the LPC result.
 603 </P>
 604 <H2 ALIGN=CENTER>Figure 11: The Utterance after UnitConcatenator
 605 Processing
 606 </H2>
 607 <P STYLE="margin-bottom: 0cm"><IMG SRC="images/img11.jpg" NAME="Graphic11" ALIGN=BOTTOM WIDTH=800 HEIGHT=617 BORDER=0>
 608 </P>
 609 <TABLE WIDTH=100% CELLPADDING=2 CELLSPACING=2>
 610         <TR>
 611                 <TD BGCOLOR="#eeeeff">
 612                         <H2><A NAME="data"></A>FreeTTS Data
 613                         </H2>
 614                 </TD>
 615         </TR>
 616 </TABLE>
 617 <P>FreeTTS uses a number of interesting data structures. These are
 618 described here.
 619 </P>
 620 <H3><A NAME="cart"></A>Classification and Regression Trees (CART)
 621 </H3>
 622 <P>The use of Classification and Regression Trees is described in the
 623 paper by L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone.
 624 <A HREF="http://citeseer.nj.nec.com/context/7119/0"><I>Classification
 625 and Regression Trees.</I></A> Additional information about how such
 626 trees can be used in the context of speech synthesis is described in
 627 <A HREF="http://festvox.org/docs/speech_tools-1.2.0/c16616.htm">Chapter
 628 10 </A>of the System Documentation of the <I>Edinburgh Speech Tools
 629 Library</I>. The Classification and Regression Trees (CART) in FreeTTS
 630 are essentially binary decision trees used to classify some part of
 631 an Utterance.
 632 </P>
 633 <P>A CART is a tree of nodes and leaves. Each node consists of the
 634 following:
 635 </P>
 636 <UL>
 637         <LI><P><B>Feature </B>The feature to test. This is in the form of a
 638         feature traversal string. For instance the feature string:
 639         </P>
 640         <PRE STYLE="margin-bottom: 0.5cm">&quot;R:SylStructure.daughter.R:Segment.p.end&quot;</PRE><P STYLE="margin-bottom: 0cm">
 641         Can be interpreted as:<BR><BR>
 642         </P>
 643 </UL>
 644 <P STYLE="margin-left: 2cm; margin-bottom: 0cm"><I>Given an Item in
 645 the syllable relation (syl), find the SylStructure Relation in that
 646 syllable, get the first daughter, find the segment associated with
 647 the daughter, find the previous segment and return its &quot;end
 648 time&quot;.</I> <BR><BR>
 649 </P>
 650 <UL>
 651         <LI><P STYLE="margin-bottom: 0cm"><B>Operand </B>The type of test to
 652         perform. The available operands are:
 653         </P>
 654         <UL>
 655                 <LI><P STYLE="margin-bottom: 0cm">LESS_THAN - The feature is less
 656                 than the value
 657                 </P>
 658                 <LI><P STYLE="margin-bottom: 0cm">EQUAL - The feature is equal to
 659                 the value
 660                 </P>
 661                 <LI><P STYLE="margin-bottom: 0cm">GREATER_THAN - The feature is
 662                 greater than the value
 663                 </P>
 664                 <LI><P STYLE="margin-bottom: 0cm">MATCHES - The feature matches the
 665                 regular expression stored in the value
 666                 </P>
 667         </UL>
 668         <LI><P STYLE="margin-bottom: 0cm"><B>Value </B>The feature value is
 669         compared based on the operand to this value.
 670         </P>
 671         <LI><P STYLE="margin-bottom: 0cm"><B>Success Node </B>If the
 672         comparison is successful, tree traversal continues at this node.
 673         </P>
 674         <LI><P STYLE="margin-bottom: 0cm"><B>Failure Node </B>If the
 675         comparison fails, traversal continues at this node.
 676         </P>
 677         <LI><P><B>Type </B>A node can be of two types, a NODE or a LEAF. A
 678         NODE is a non-terminal member of the tree, whereas a LEAF is a
 679         terminal node. Once the interpretation of a CART reaches a LEAF
 680         node, the value for that node is returned.
 681         </P>
 682 </UL>
 683 <P>Typically, an UtteranceProcessor will employ a CART tree to
 684 classify a particular Item or part of an Utterance. The CART
 685 processing proceeds as follow:
 686 </P>
 687 <UL>
 688         <LI><P STYLE="margin-bottom: 0cm">Starting at the first node in the
 689         CART, extract the feature pointed to by the node.
 690         </P>
 691         <LI><P STYLE="margin-bottom: 0cm">Compare based upon the NODE
 692         operand to the node value. If the comparison succeeds proceed to the
 693         Success Node, otherwise go to the Failure Node.
 694         </P>
 695         <LI><P>Continue processing nodes in this fashion until a LEAF node
 696         is reached, at which point return the value of that node.
 697         </P>
 698 </UL>
 699 <CENTER>
 700         <TABLE WIDTH=100% BORDER=1 CELLPADDING=2 CELLSPACING=2>
 701                 <TR>
 702                         <TH COLSPAN=4>
 703                                 <P>CARTS used by FreeTTS
 704                                 </P>
 705                         </TH>
 706                 </TR>
 707                 <TR>
 708                         <TH>
 709                                 <P>Name
 710                                 </P>
 711                         </TH>
 712                         <TH>
 713                                 <P>Location
 714                                 </P>
 715                         </TH>
 716                         <TH>
 717                                 <P># Nodes
 718                                 </P>
 719                         </TH>
 720                         <TH>
 721                                 <P>Description
 722                                 </P>
 723                         </TH>
 724                 </TR>
 725                 <TR>
 726                         <TH>
 727                                 <P>Phraser
 728                                 </P>
 729                         </TH>
 730                         <TD>
 731                                 <P>en/us/durz_cart.txt
 732                                 </P>
 733                         </TD>
 734                         <TD>
 735                                 <P>40 nodes
 736                                 </P>
 737                         </TD>
 738                         <TD>
 739                                 <P>used to determine where to place breaks in phrases.
 740                                 </P>
 741                         </TD>
 742                 </TR>
 743                 <TR>
 744                         <TH>
 745                                 <P>Accent
 746                                 </P>
 747                         </TH>
 748                         <TD>
 749                                 <P>en/us/int_accent_cart.txt
 750                                 </P>
 751                         </TD>
 752                         <TD>
 753                                 <P>150 nodes
 754                                 </P>
 755                         </TD>
 756                         <TD>
 757                                 <P>used to determine where to apply syllable accents .
 758                                 </P>
 759                         </TD>
 760                 </TR>
 761                 <TR>
 762                         <TH>
 763                                 <P>Tone
 764                                 </P>
 765                         </TH>
 766                         <TD>
 767                                 <P>en/us/int_tone_cart.txt
 768                                 </P>
 769                         </TD>
 770                         <TD>
 771                                 <P>100 nodes
 772                                 </P>
 773                         </TD>
 774                         <TD>
 775                                 <P>used to determine the type of 'end tone' for syllables.
 776                                 </P>
 777                         </TD>
 778                 </TR>
 779                 <TR>
 780                         <TH>
 781                                 <P>Duration
 782                                 </P>
 783                         </TH>
 784                         <TD>
 785                                 <P>durz_cart.txt
 786                                 </P>
 787                         </TD>
 788                         <TD>
 789                                 <P>800 nodes
 790                                 </P>
 791                         </TD>
 792                         <TD>
 793                                 <P>used to determine the duration for each segment of an
 794                                 Utterance.
 795                                 </P>
 796                         </TD>
 797                 </TR>
 798                 <TR>
 799                         <TH>
 800                                 <P>TokenToWords
 801                                 </P>
 802                         </TH>
 803                         <TD>
 804                                 <P>en/us/nums_cart.txt
 805                                 </P>
 806                         </TD>
 807                         <TD>
 808                                 <P>100 nodes
 809                                 </P>
 810                         </TD>
 811                         <TD>
 812                                 <P>used to classify numbers as cardinal, digits, ordinal or year.
 813                                                                 </P>
 814                         </TD>
 815                 </TR>
 816                 <TR>
 817                         <TH>
 818                                 <P>ClusterUnitSelection
 819                                 </P>
 820                         </TH>
 821                         <TD>
 822                                 <P>en/us/cmu_awb/cmu_time_awb.txt
 823                                 </P>
 824                         </TD>
 825                         <TD>
 826                                 <P>130 CARTS, 2 nodes each
 827                                 </P>
 828                         </TD>
 829                         <TD>
 830                                 <P>The cluster unit database contains 132 separate CART trees,
 831                                 each of which contains just a couple or so nodes. These CARTS are
 832                                 used to select phoneme units.
 833                                 </P>
 834                         </TD>
 835                 </TR>
 836         </TABLE>
 837 </CENTER>
 838 <H3><a name="lexicon">Lexicon</a>
 839 </H3>
 840 <P>The Lexicon provides a mapping of words to their pronunciations.
 841 FreeTTS provides a generic lexicon interface
 842 (<FONT FACE="Courier, sans-serif">com.sun.speech.freetts.lexicon)</FONT>
 843 and a specific implementation, <FONT FACE="Courier, sans-serif">com.sun.speech.freetts.en.us.CMULexicon</FONT>
 844 that provides a English language lexicon based upon CMU data. The
 845 essential function of a Lexicon is to determine the pronunciation of
 846 a word. The retrieval is done via the interface: Lexicon.getPhones.
 847 word.
 848 </P>
 849 <P>The Lexicon interface provides the ability to add new words to the
 850 Lexicon.
 851 </P>
 852 <P>The CMULexicon is an implementation of the Lexicon interface that
 853 supports the Flite CMU Lexicon. The CMULexicon contains over 60,000
 854 pronunciations. Here is a snippet:
 855 </P>
 856 <PRE>abbasi0 aa b aa1 s iy
 857 abbate0 aa1 b ey t
 858 abbatiello0     aa b aa t iy eh1 l ow
 859 abbe0   ae1 b iy
 860 abbett0 ax b eh1 t
 861 abbie0  ae1 b iy
 862 abbitt0 ae1 b ax t
 863 abbot0  ae1 b ax t
 864 abboud0 ax b uw d
 865 abbreviate0     ax b r iy1 v iy ey1 t
 866 abbruzzese0     aa b r uw t s ey1 z iy
 867 abbs0   ae1 b z
 868 abby0   ae1 b iy
 869 abco0   ae1 b k ow
 870 abdel0  ae1 b d eh1 l
 871 abdicating0     ae1 b d ih k ey1 t ih ng </PRE><P>
 872 Each entry contains a word, with a part-of-speech tag appended to it,
 873 followed by the phones representing the pronunciation. A separate
 874 file maintains the addenda. The addenda is a smaller set of
 875 pronunciations typically used to provide custom or application or
 876 domain specific pronunciations.
 877 </P>
 878 <P>The CMULexicon implementation also relies on a set of
 879 Letter-To-Sound rules. These rules can automatically determine the
 880 pronunciation of a word. When the pronunciation of a word is
 881 requested, the CMULexicon will first look it up in the main list of
 882 words. If it is not found, the addenda is checked. If the word is
 883 still not found, then the Letter-To-Sound rules are used to convert
 884 the words into phones. To conserve space, the CMULexicon has been
 885 stripped of all words that can be recreated using the Letter-To-Sound
 886 rules. One can look at the 60,000 pronunciations in the Lexicon as
 887 exceptions to the rule.
 888 </P>
 889 <P>The Lexicon data is represented in two forms: text and binary. The
 890 binary form loads much quicker than the text form of the data and is
 891 the form that is generally used by FreeTTS. FreeTTS provides a method
 892 of generating the binary form of the Lexicon from the text form of
 893 the Lexicon.
 894 </P>
 895 <H3>Letter-To-Sound Rules
 896 </H3>
 897 <P>The Letter-To-Sound (LTS) rules are used to generate a phone
 898 sequence for words not in the Lexicon. The LTS rules are a simple
 899 state machine, with one entry point for each letter of the alphabet.
 900 </P>
 901 <P>The state machine consists of a large list of entries. There are
 902 two types of entries: a STATE and a PHONE. A STATE entry contains a
 903 decision and the indices of two other entries. The first of these two
 904 indices represents where to go if the decision is true, and the
 905 second represents where to go if the decision is false. A PHONE entry
 906 is the final state of the decision tree and contains the phone that
 907 should be returned.
 908 </P>
 909 <P>The decision in FreeTTS's case is a simple character comparison,
 910 but it is done in the context of a window around the character in the
 911 word. The decision consists of a index into the context window and a
 912 character value. If the character in the context window matches the
 913 character value, then the decision is true. The machine traversal for
 914 each letter starts at that letter's entry in the state machine and
 915 ends only when it reaches a final state. If there is no phone that
 916 can be mapped, the phone in the final state is set to 'epsilon.' The
 917 context window for a character is generated in the following way:
 918 </P>
 919 <UL>
 920         <LI><P STYLE="margin-bottom: 0cm">Pad the original word on either
 921         side with '#' and '0' characters to the size of the window for the
 922         LTS rules (in FreeTTS's case, this is 4). The &quot;#&quot; is used
 923         to indicate the beginning and end of the word. So, the word &quot;monkey&quot;
 924         would turn into &quot;000#monkey#000&quot;.
 925         </P>
 926         <LI><P>For each character in the word, the context window consists
 927         of the characters in the padded form that precede and follow the
 928         word. The number of characters on each side is dependent upon the
 929         window size. So, for FreeTTS, the context window for the 'k' in
 930         monkey is &quot;#money#0&quot;.
 931         </P>
 932 </UL>
 933 <P>Here's how the phone for 'k' in 'monkey' might be determined:
 934 </P>
 935 <UL>
 936         <LI><P STYLE="margin-bottom: 0cm">Create the context window
 937         &quot;#monkey#0&quot;.
 938         </P>
 939         <LI><P STYLE="margin-bottom: 0cm">Start at the state machine entry
 940         for 'k' in the state machine.
 941         </P>
 942         <LI><P STYLE="margin-bottom: 0cm">Grab the 'index' from the current
 943         state. This represents an index into the context window. Compare the
 944         value of the character at the index in the context window to the
 945         character from the current state. If there is a match, the next
 946         state is the true value. If there is not a match, the next state is
 947         the false state.
 948         </P>
 949         <LI><P STYLE="margin-bottom: 0cm">Repeat the previous step until you
 950         read a final state.
 951         </P>
 952         <LI><P>When you get to the final state, the phone is the character
 953         in that state.
 954         </P>
 955 </UL>
 956 <H3>Unit Selection
 957 </H3>
 958 <P>The designers of FreeTTS have written it in such a way that the
 959 unit selection can be done using several methods. The current methods
 960 are diphone and cluster unit selection.
 961 </P>
 962 <P>Luckily, the unit selection is independent of the wave synthesis.
 963 As a result, if the units from either unit selection type share the
 964 same format, the same wave synthesis technique can be used. This is
 965 the case for the KAL diphone and AWB cluster unit voices.
 966 </P>
 967 <H4>Diphone Unit Selection
 968 </H4>
 969 <P>The diphone unit selection is very simple: it combines each
 970 adjacent phoneme into a pair separated by a &quot;-&quot;. These
 971 pairs are used to look up entries in the diphone database.
 972 </P>
 973 <H4>Cluster Unit Selection
 974 </H4>
 975 <P>The cluster unit selection is a bit more complex. Instead of
 976 working with diphones, it works on one unit at a time, and there can
 977 be more than one instance of a unit per database.
 978 </P>
 979 <P>The first step in cluster unit selection determines the unit type
 980 for each unit in the Utterance. The unit type for selection in the
 981 simple talking clock example (cmu_time_awb) is done per phone. The
 982 unit type consists of the phone name followed by the word the phone
 983 comes from (e.g., n_now for the phone 'n' in the word 'now').
 984 </P>
 985 <P>The unit database contains a plurality of instances per unit type,
 986 and they are indexed by number (e.g., n_now_0, n_now_1, etc.). Also
 987 included in this database are what unit instances come before and
 988 after each unit (e.g., n_now_13 is preceded by z_is_13 and is
 989 followed by unit_aw_now_13).
 990 </P>
 991 <P>Once the unit types have been determined, the next step is to
 992 select the best unit instance. This is done using a Viterbi algorithm
 993 where the cost is based upon the Mel-Cepstra distance between
 994 candidates. The candidate selection is determined using two things:
 995 </P>
 996 <UL>
 997         <LI><P STYLE="margin-bottom: 0cm">A CART - given an item, the CART
 998         will return a list of the unit type instances that are potential
 999         choices. Most of the CARTs in cmu_time_awb are very simple - there
1000         are no choices and the first node is a leaf node containing the
1001         list.
1002         </P>
1003         <LI><P>Extended selections. For the unit preceding the current unit,
1004         the candidate selection will search the units following that unit.
1005         If the unit type is the same as the current unit, then that unit is
1006         added as a candidate.
1007         </P>
1008 </UL>
1009 <P>After the candidates are chosen, the Viterbi algorithm is used to
1010 calculate path costs. The basic algorithm is as follows:
1011 </P>
1012 <UL>
1013         <LI><P STYLE="margin-bottom: 0cm">For each candidate for the current
1014         unit, calculate the cost between it and the first candidate in the
1015         next unit. Save only the path that has the least cost. By default,
1016         if two candidates come from units that are adjacent in the database,
1017         the cost is 0 (i.e., they were spoken together, so they are a
1018         perfect match).
1019         </P>
1020         <LI><P STYLE="margin-bottom: 0cm">Repeat the previous process for
1021         each candidate in the next unit, creating a list of least cost paths
1022         between the candidates between the current unit and the unit
1023         following it.
1024         </P>
1025         <LI><P STYLE="margin-bottom: 0cm">Toss out all candidates in the
1026         current unit that are not included in a path.
1027         </P>
1028         <LI><P>Move to the next unit and repeat the process.
1029         </P>
1030 </UL>
1031 <P>Once the whole tree is done, the path(s) with the least cost
1032 should be identified, and these represent the RELP encoded samples to
1033 choose from the database.
1034 </P>
1035 <TABLE WIDTH=100% CELLPADDING=2 CELLSPACING=2>
1036         <TR>
1037                 <TD BGCOLOR="#eeeeff">
1038                         <H2><A NAME="code"></A>CMUDiphoneVoice Code Walkthrough
1039                         </H2>
1040                 </TD>
1041         </TR>
1042 </TABLE>
1043 <P>In this section, we will look at the CMUDiphoneVoice class to see
1044 how a new voice is created and customized.
1045 </P>
1046 <PRE>
1047 /**
1048  * Defines an unlimited-domain diphone synthesis based voice
1049  */</PRE><P>
1050 <FONT COLOR="#008000">The CMUDiphoneVoice class extends the CMUVoice.
1051 CMUVoice provides much of the standard voice definition including
1052 loading of the Lexicon, setting up of common features, setting up of
1053 UtteranceProcessors. </FONT>
1054 </P>
1055 <PRE>public class CMUDiphoneVoice extends CMUVoice {
1056
1057     /**
1058      * Creates a simple voice
1059      */
1060 </PRE><P>
1061 <FONT COLOR="#008000">It is possible and quite likely that multiple
1062 voices will want to share a single Lexicon. By passing false to the
1063 CMUVoice constructor, this voice indicates that by default no Lexicon
1064 should be created. This allows a Voice manager (such as the
1065 FreeTTSSynthesizer to create a single Lexicon and have multiple voices
1066 share it. </FONT>
1067 </P>
1068 <PRE>    public CMUDiphoneVoice() {
1069         this(false);
1070     }
1071
1072     /**
1073      * Creates a simple voice
1074      *
1075      * @param createLexicon if true automatically load up
1076      * the default CMU lexicon; otherwise, don't load it.
1077      */</PRE><P>
1078 <FONT COLOR="#008000">This version of the constructor sets the
1079 standard rate, pitch and range values for the voice. </FONT>
1080 </P>
1081 <PRE>    public CMUDiphoneVoice(boolean createLexicon) {
1082         super(createLexicon);
1083         setRate(150f);
1084         setPitch(100F);
1085         setPitchRange(11F);
1086     }
1087
1088     /**
1089      * Sets the FeatureSet for this Voice.
1090      *
1091      * @throws IOException if an I/O error occurs
1092      */</PRE><P>
1093 If this voice needed to add or customize the feature set, it would do
1094 so here. This voice is happy with the default features provided by
1095 CMUVoice, so nothing is done here.
1096 </P>
1097 <PRE>    protected void setupFeatureSet() throws IOException {
1098         super.setupFeatureSet();
1099     }
1100
1101     /**
1102      * Returns the post lexical processor to be used by this voice.
1103      * Derived voices typically override this to customize behaviors.
1104      *
1105      * @return the Unit selector
1106      *
1107      * @throws IOException if an IO error occurs while getting
1108      *     processor
1109      */</PRE><P>
1110 <FONT COLOR="#008000">Here is an example of how to override the
1111 default Utterance processing provided by CMUVoice. CMUDiphoneVoice
1112 needs to provide a post-lexical analyzer that converts one phone &quot;ah&quot;
1113 to another &quot;aa&quot;. CMUVoice provides a number of 'getXXXX'
1114 functions that return the UtteranceProcessor that will be used for
1115 that stage of processing. CMUDiphoneVoice overrides the
1116 getPostLexicalAnalyzer method to provide the customized post lexical
1117 analyzer. </FONT>
1118 </P>
1119 <PRE>    protected UtteranceProcessor getPostLexicalAnalyzer() throws IOException {
1120         return new CMUDiphoneVoicePostLexicalAnalyzer();
1121     }
1122
1123     /**
1124      * Returns the pitch mark generator to be used by this voice.
1125      * Derived voices typically override this to customize behaviors.
1126      * This voice uses a DiphonePitchMark generator to generate
1127      * pitchmarks.
1128      *
1129      * @return the post lexical processor
1130      *
1131      * @throws IOException if an IO error occurs while getting
1132      *     processor
1133      */
1134 </PRE><P>
1135 <FONT COLOR="#008000">The diphone voice needs to provide a customized
1136 pitchmark generator that is specific to diphone synthesis.</FONT>
1137 </P>
1138 <PRE>    protected UtteranceProcessor getPitchmarkGenerator() throws IOException {
1139         return new DiphonePitchmarkGenerator();
1140     }
1141
1142     /**
1143      * Returns the unit concatenator to be used by this voice.
1144      * Derived voices typically override this to customize behaviors.
1145      * This voice uses a relp.UnitConcatenator to concatenate units.
1146      *
1147      * @return the post lexical processor
1148      *
1149      * @throws IOException if an IO error occurs while getting
1150      *     processor
1151      */
1152 </PRE><P>
1153 <FONT COLOR="#008000">This voice uses the standard UnitConcatenator. </FONT>
1154 </P>
1155 <PRE>    protected UtteranceProcessor getUnitConcatenator() throws IOException {
1156         return new UnitConcatenator();
1157     }
1158
1159
1160     /**
1161      * Returns the unit selector to be used by this voice.
1162      * Derived voices typically override this to customize behaviors.
1163      * This voice uses the DiphoneUnitSelector to select units. The
1164      * unit selector requires the name of a diphone database. If no
1165      * diphone database has been specified (by setting the
1166      * DATABASE_NAME feature of this voice) then by default
1167      * cmu_kal/diphone_units.bin is used.
1168      *
1169      * @return the post lexical processor
1170      *
1171      * @throws IOException if an IO error occurs while getting
1172      *     processor
1173      */</PRE><P>
1174 <FONT COLOR="#008000">The unit selector is also diphone specific.
1175 Note that this method also specifies which diphone unit database to
1176 use if none has been supplied already. </FONT>
1177 </P>
1178 <PRE>    protected UtteranceProcessor getUnitSelector() throws IOException {
1179         String unitDatabaseName = getFeatures().getString(DATABASE_NAME);
1180
1181         if (unitDatabaseName == null) {
1182             unitDatabaseName = &quot;cmu_kal/diphone_units.bin&quot;;
1183         }
1184
1185         return new DiphoneUnitSelector(
1186             this.getClass().getResource(unitDatabaseName));
1187     }
1188
1189
1190     /**
1191      * Converts this object to a string
1192      *
1193      * @return a string representation of this object
1194      */
1195     public String toString() {
1196         return &quot;CMUDiphoneVoice&quot;;
1197     }
1198 }
1199
1200
1201 /**
1202  * Annotates the Utterance with post lexical information. Converts AH
1203  * phonemes to AA phoneme in addition to the standard english postlex
1204  * processing.
1205  */
1206 </PRE><P>
1207 <FONT COLOR="#008000">Here is an example of defining a new
1208 UtteranceProcessor. This UtteranceProcessor traverses through the
1209 SEGMENT Relation looking for all phones of type &quot;ah&quot; and
1210 converts them to &quot;aa&quot; phones. Since this Processor is used
1211 to replace the default post-lexical analyzer processor, it invokes
1212 the default post-lexical analyzer after performing the custom
1213 processing. </FONT>
1214 </P>
1215 <PRE>class CMUDiphoneVoicePostLexicalAnalyzer implements UtteranceProcessor {
1216     UtteranceProcessor englishPostLex =
1217         new com.sun.speech.freetts.en.PostLexicalAnalyzer();
1218
1219     /**
1220      * performs the processing
1221      * @param  utterance  the utterance to process/tokenize
1222      * @throws ProcessException if an IOException is thrown during the
1223      *         processing of the utterance
1224      */
1225     public void processUtterance(Utterance utterance) throws ProcessException {
1226         fixPhoneme_AH(utterance);
1227         englishPostLex.processUtterance(utterance);
1228     }
1229
1230
1231     /**
1232      * Turns all AH phonemes into AA phonemes.
1233      * This should really be done in the index itself
1234      * @param utterance the utterance to fix
1235      */
1236     private void fixPhoneme_AH(Utterance utterance) {
1237         for (Item item = utterance.getRelation(Relation.SEGMENT).getHead();
1238                 item != null;
1239                 item = item.getNext()) {
1240             if (item.getFeatures().getString(&quot;name&quot;).equals(&quot;ah&quot;)) {
1241                 item.getFeatures().setString(&quot;name&quot;, &quot;aa&quot;);
1242             }
1243         }
1244     }
1245
1246     // inherited from Object
1247     public String toString() {
1248         return &quot;PostLexicalAnalyzer&quot;;
1249     }
1250 }</PRE>
1251 <TABLE WIDTH=100% CELLPADDING=2 CELLSPACING=2>
1252         <TR>
1253                 <TD BGCOLOR="#eeeeff">
1254                         <H2><A NAME="packaging"></A>Voice Packaging
1255                         </H2>
1256                 </TD>
1257         </TR>
1258 </TABLE>
1259 <P> FreeTTS has been designed to allow flexible and dynamic addition of voices.
1260 </P>
1261 <H3>Voice Packages</H3>
1262 <P>
1263 Voices are defined by their corresponding VoiceDirectories.  These directories
1264 are what actually create the instances of the voices, and can create several
1265 different voices.  This is useful if a single voice can sound dramatically
1266 different if it is created through different parameters.  Then the directory
1267 can return more than one instance of the same voice class though they may
1268 sound different.  It may also be useful for the same voice package to contain
1269 more than one voice, allowing a single interface to those voices.
1270 The voice directory MUST also provide a main() function that will print
1271 out information about the voice if invoked.  Typically this is done by
1272 simply calling the VoiceDirectory's toString() method.
1273 </P>
1274 <P> A voice package is a jar file that contains exactly one subclass of
1275 VoiceDirectory.  The package probably contains data files as well as other
1276 java classes that implement the voices provided.  The jarfile Manifest
1277 must also include three entries:</P>
1278 <OL>
1279     <LI><P>&quot;Main-Class&quot; which will be the VoiceDirectory class,
1280         and prints out information about the voices provided</P>
1281     <LI><P>&quot;FreeTTSVoiceDefinition: true&quot;, which informs
1282         FreeTTS that this jarfile is a voice package</P>
1283     <LI><P>&quot;Class-Path:&quot; which lists all the jars upon which
1284         this voice package is dependent.  For example, the voice may
1285         be dependent upon its lexicon jarfile.  This allows a user
1286         to simply execute the main() function without having to specify
1287         all of the dependencies (which the user may not know).</P>
1288 </OL>
1289 <H3>Installing Voice Packages</H3>
1290 <P>Voice Packages can be added to FreeTTS without any compilation.  There
1291 are two ways to alert FreeTTS to the presence of a new voice:</P>
1292 <OL>
1293     <LI><P>Listing of the VoiceDirectory classes that are loaded.</P>
1294     <LI><P>Putting the packages in the correct directory and allowing
1295             FreeTTS to automatically detect them.
1296             [[[TODO: This is not yet implemented.  For now use the listing method.]]]
1297             </P>
1298 </OL>
1299 <P>Listing of the VoiceDirectory classes requires all of the required classes
1300 to be appropriately in the java classpath.  The names of the voice directory
1301 classes are listed in voices files.  When VoiceManager.getVoices() is
1302 called, reads several files.</P>
1303 <OL>
1304     <LI><P>First, it looks for internal_voices.txt, stored
1305     in the same directory as VoiceManager.class (If the VoiceManager is in a
1306     jarfile, which it probably is, then this file is also inside the jar file).
1307     If the file does not exist, FreeTTS moves on.  internal_voices.txt only
1308     exists to allow one to package FreeTTS into a single stand-alone jarfile,
1309     as may be needed for applets.  Avoid using internal_voices.txt if
1310     at all possible.  The file then requires you to ship all listed voices
1311     along with FreeTTS and provides minimal flexibility.</P>
1312
1313     <LI><P>Next, FreeTTS looks for voices.txt in the same directory as
1314     freetts.jar (assuming FreeTTS is being executed from a jar, which
1315     it probably is).  If the file does not exist, FreeTTS moves on.</P>
1316
1317     <LI><P>Last, if the system property &quot;freetts.voicesfile&quot;
1318     is defined, then FreeTTS will use the voice directory classes
1319     listed in that file.</P>
1320 </OL>
1321
1322 <P>Voice packages can also be recognized simply by putting them in
1323 the correct filesystem directory.
1324 [[[TODO: At least, that is the plan.  This is not yet actually implemented.]]]
1325 If a jarfile is in the correct directory
1326 and has the &quot;FreeTTSVoiceDefinition: true&quot; definition
1327 in its Manifest, then it is assumed to be a voice package.  The file
1328 is then loaded along with all dependencies listed in the
1329 &quot;Class-Path:&quot; definition.  Whatever class is listed as the
1330 &quot;Main-Class:&quot; is assumed to be the voice directory.  There
1331 are two ways to specify which filesystem directory to look in:</P>
1332 <OL>
1333     <LI><P>By default, FreeTTS will look in the same directory as
1334         freetts.jar.  (Assuming FreeTTS was loaded from a jarfile, which
1335         it probably was).</P>
1336     <LI><P>The directories specified by the system property
1337         &quot;freetts.voicespath&quot;.</P>
1338 </OL>
1339
1340 <H3>Compiling Voice Packages</H3>
1341 <P>To create a voice package you simply need to meet the qualifications
1342 above.  However that can be a bit of work.  If you want to import a voice
1343 from FestVox, there are tools in tools/FestVoxToFreeTTS/.  View the
1344 README file there for more information.  The scripts can automatically
1345 import a US/English voice, but is not designed to handle others.  For the
1346 simple case of US/English voices, simply put them in a subdirectory of
1347 com/sun/speech/freetts/en/us/.  Files ending with &quot;.txt&quot; will
1348 be assumed to be data files for the voice and compiled into their
1349 &quot;.bin&quot; and &quot;.idx&quot; equivalents.  The file
1350 &quot;voice.Manifest&quot; will automatically be added to the Manifest
1351 of the voice package's jarfile.  The compilation system will automatically
1352 detect new directories inside en/us, assume they are voice packages,
1353 and create new jarfiles for them.</P>
1354
1355 <HR>
1356 <P STYLE="margin-bottom: 0cm">See the <A HREF="../license.terms">license
1357 terms</A> and <A HREF="../acknowledgments.txt">acknowledgments</A>.<BR>Copyright
1358 2003 Sun Microsystems, Inc. All Rights Reserved. Use is subject to
1359 license terms.
1360 </P>
1361 </BODY>
1362 </HTML>