1 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
4 <META HTTP-EQUIV="CONTENT-TYPE" CONTENT="text/html; charset=iso-8859-1">
5 <TITLE>FreeTTS Programmer's Guide</TITLE>
6 <META NAME="GENERATOR" CONTENT="StarOffice 6.0 (Solaris Sparc)">
7 <META NAME="CREATED" CONTENT="20011115;9153400">
8 <META NAME="CHANGED" CONTENT="20011116;15171800">
10 * Copyright (c) 2001 Sun Microsystems, Inc.
12 * See the file "license.terms" for information on usage and
13 * redistribution of this file, and for a DISCLAIMER OF ALL
19 <BODY BGCOLOR="#ffffff">
21 <TABLE WIDTH=100% CELLPADDING=2 CELLSPACING=2 BGCOLOR="#ccccee" STYLE="page-break-before: always">
24 <H1 ALIGN=CENTER>FreeTTS Programmer's Guide</H1>
29 <TABLE WIDTH=100% BORDER=0 CELLPADDING=2 CELLSPACING=2>
31 <TD WIDTH=25% VALIGN=TOP BGCOLOR="#eeeeff">
32 <P><BR><B>Table of Contents </B><BR><A HREF="#organization">FreeTTS
33 Organization </A><BR><A HREF="#objects">Major FreeTTS Objects
34 </A><BR><A HREF="#processing">Processing Walkthrough</A> <BR><A HREF="#data">FreeTTS
35 Data</A> <BR><A HREF="#code">Code Walkthrough </A>
36 <BR><A HREF="#packaging">Voice Packaging</A>
38 <P><B>Related Documentation </B><BR><A HREF="index.html">FreeTTS
39 Overview </A><BR><A HREF="../javadoc/index.html">FreeTTS API </A><BR><A HREF="http://java.sun.com/products/java-media/speech/forDevelopers/jsapi-doc/index.html">The
40 Java Speech API (JSAPI) </A><BR><A HREF="http://java.sun.com/products/java-media/speech/forDevelopers/jsapi-guide/index.html">JSAPI
41 Programmer's Guide </A><BR><A HREF="http://www.cmuflite.org/">Flite
42 </A><BR><A HREF="http://www.speech.cs.cmu.edu/festival/index.html">Festival
48 <P><I>What this is </I>- This is an overview of how FreeTTS works
49 from a programmer's point of view. It describes the major classes
50 and objects used in FreeTTS, provides a data-flow walkthrough of
51 FreeTTS as it synthesizes speech, and provides an annotated
52 definition of a voice that serves as an example of how to define
55 <P><I>What this is not</I> - This is not an API guide to FreeTTS,
56 nor is it a tutorial on the fundamentals of speech synthesis.
58 <P>The FreeTTS package is based upon Flite, a light-weight
59 synthesis package developed at CMU. FreeTTS retains the core
60 architecture of Flite. Anyone who is familiar with the workings of
61 Flite will probably feel comfortable working with FreeTTS. Since
62 Flite itself is based upon the Festival speech synthesis system,
63 those who are experienced with the Festival package will notice
64 some similarities between Festival and FreeTTS.
70 <TABLE WIDTH=100% CELLPADDING=2 CELLSPACING=2>
72 <TD BGCOLOR="#eeeeff">
73 <H2><A NAME="organization"></A>FreeTTS Organization
78 <P>FreeTTS is organized as a number of trees as follows:
83 <LI><P STYLE="margin-bottom: 0cm"><B>javax.speech </B>contains the
84 generic JSAPI speech implementation. Code in this tree is
85 independent of any speech synthesis system.
88 <LI><P STYLE="margin-bottom: 0cm"><B>com.sun.speech.engine </B>contains
89 support for JSAPI 1.0. Various packages
90 can be found in this tree.
92 <LI><P><B>com.sun.speech.freetts </B>contains the implementation of
93 the FreeTTS synthesis engine. The bulk of the code can be found in
95 <B>com.sun.speech.freetts.jsapi</B> package provides the
96 JSAPI glue code for FreeTTS.
99 <P>The <B>com.sun.speech.freetts</B> package is broken down further
100 into sets of sub-packages as follows:
103 <LI><P STYLE="margin-bottom: 0cm"><B>com.sun.speech.freetts </B>contains
104 high-level interfaces and classes for FreeTTS. Much non-language and
105 voice dependent code can be found here.
107 <LI><P STYLE="margin-bottom: 0cm"><B>com.sun.speech.freetts.diphone
108 </B>provides support for diphone encoded speech.
110 <LI><P STYLE="margin-bottom: 0cm"><B>com.sun.speech.freetts.clunits
111 </B>provides support for cluster-unit encoded speech.
113 <LI><P STYLE="margin-bottom: 0cm"><B>com.sun.speech.freetts.lexicon
114 </B>provides definition and implementation of the Lexicon and
115 Letter-to-Sound Rules.
117 <LI><P STYLE="margin-bottom: 0cm"><B>com.sun.speech.freetts.util
118 </B>provides a set of tools and utilities.
120 <LI><P STYLE="margin-bottom: 0cm"><B>com.sun.speech.freetts.audio
121 </B>provides audio output support.
123 <LI><P STYLE="margin-bottom: 0cm"><B>com.sun.speech.freetts.cart
124 </B><SPAN STYLE="font-weight: medium">p</SPAN>rovides interface and
125 implementations of several Classification and Regression Trees
128 <LI><P STYLE="margin-bottom: 0cm"><B>com.sun.speech.freetts.relp
129 </B>provides support for Residual Excited Linear Predictive (RELP)
130 decoding of audio samples.
132 <LI><P STYLE="margin-bottom: 0cm"><B>com.sun.speech.freetts.en
133 </B>contains <SPAN STYLE="font-weight: medium">English</SPAN>
135 <LI><P><B>com.sun.speech.freetts.en.us </B><SPAN STYLE="font-weight: medium">contains
136 </SPAN>US-English specific code.</P>
138 <TABLE WIDTH=100% CELLPADDING=2 CELLSPACING=2>
140 <TD BGCOLOR="#eeeeff">
141 <H2><A NAME="objects"></A>Major FreeTTS Objects
146 <P>There are a number of objects that work together to perform speech
149 <H3><A HREF="../javadoc/com/sun/speech/freetts/FreeTTSSpeakable.html">FreeTTSSpeakable</A></H3>
150 <P>FreeTTSSpeakable is an interface. Anything that is a source of text
151 that needs to be spoken with FreeTTS is first converted into a
152 FreeTTSSpeakable. One implementation of this interface is
153 FreeTTSSpeakableImpl. This implementation will wrap the most common
154 input forms (a String, an InputStream, or a JSML XML document) as a
155 FreeTTSSpeakable. A FreeTTSSpeakable is given to a Voice to be spoken.
157 <H3><A HREF="../javadoc/com/sun/speech/freetts/Voice.html">Voice</A></H3>
158 <P>The Voice is the central processing point for FreeTTS. The Voice
159 takes as input a FreeTTSSpeakable, translates the text associated with
160 the FreeTTSSpeakable into speech and generates audio output
161 corresponding to that speech. The Voice is the primary customization
162 point for FreeTTS. Language, speaker, and algorithm customizations can
163 all be performed by extending the Voice. A Voice will accept a
164 FreeTTSSpeakable via the Voice.speak method and process it as follows:
167 <LI><P STYLE="margin-bottom: 0cm">The Voice converts a
168 FreeTTSSpeakable into a series of Utterances. The rules for breaking
169 a FreeTTSSpeakable into an Utterance is generally language dependent.
170 For instance, an English Voice may chose to break a FreeTTSSpeakable
171 into Utterances based upon sentence breaks.
173 <LI><P STYLE="margin-bottom: 0cm">As the Voice generates each
174 Utterance, a series of UtteranceProcessors processes the Utterance.
175 Each Voice defines its own set of UtteranceProcessors. This is the
176 primary method of customizing Voice behavior. For instance, to
177 change how units are joined together during the synthesis process, a
178 Voice would simply supply a new UtteranceProcessor that implements
179 the new algorithm. Typically each UtteranceProcessor will run in
180 turn, annotating or modifying the Utterance with information. For
181 instance, a 'Phrasing' UtteranceProcessor may insert phrase marks
182 into an Utterance that indicate where a spoken phrase begins. The
183 Utterance and UtteranceProcessors are described in more detail
186 <LI><P>Once all Utterance processing has been applied, the Voice
187 sends the Utterance to the AudioOutput UtteranceProcessor. The
188 AudioOutput processor may run in a separate thread to allow
189 Utterance processing to overlap with audio output, ensuring the
190 lowest sound latency possible.
193 <H3><A HREF="../javadoc/com/sun/speech/freetts/VoiceManager.html">VoiceManager</A></H3>
194 <P>The VoiceManager is the central repository of voices available to
195 FreeTTS. To get a voice you can do:
197 VoiceManager voiceManager = VoiceManager.getInstance();
199 // create a list of new Voice instances
200 Voice[] voices = voiceManager.getVoices();
202 // iterate through the list until you find a Voice with the properties
206 // allocate the resources for the voice
207 voices[x].allocate();
210 You can save yourself the chore of iterating through the voices if you
211 already know the name of the Voice you want by using voiceManager.getVoice()
213 <H3><A HREF="../javadoc/com/sun/speech/freetts/Utterance.html">Utterance</A></H3>
214 <P>The Utterance is the central processing target in FreeTTS. A
215 FreeTTSSpeakable is broken up into one or more Utterances, processed
216 by a series of UtteranceProcessors, and finally output as audio. An
217 Utterance consists of a set of Relations and a set of features called
220 <H3><A HREF="../javadoc/com/sun/speech/freetts/FeatureSet.html">FeatureSet</A></H3>
221 <P>A FeatureSet is simply a Name/Value pair. An Utterance can contain
222 an arbitrary number of FeatureSets. FeatureSets are typically used to
223 maintain global Utterance information such as volume, pitch and
226 <H3><A HREF="../javadoc/com/sun/speech/freetts/Relation.html">Relation</A></H3>
227 <P>A Relation is a named list of Items. An Utterance can hold an
228 arbitrary number of Relations. A typical UtteranceProcessor may
229 iterate through one Relation and create a new Relation. For
230 instance, a word normalization UtteranceProcessor could iterate
231 through a token Relation and generate a word Relation based upon
232 token-to-word rules. A detailed description of the Utterance
233 processing and how it affects the Relations in an Utterance is
236 <H3><A HREF="../javadoc/com/sun/speech/freetts/Item.html">Item</A></H3>
237 <P>A Relation is a list of Item objects. An Item contains a set of
238 Features (as described previously, FeatureSets are merely name/value
239 pairs). An Item can have a list of daughter Items as well. Items in a
240 Relation are linked to Items in the same and other Relations. For
241 instance, the words in a word Relation are linked back to the
242 corresponding tokens in the token Relation. Similarly, a word in a
243 word Relation is linked to the previous and next words in the word
244 Relation. This gives an UtteranceProcessor the capability of easily
245 traversing from one Item to another.
247 <H3><A HREF="../javadoc/com/sun/speech/freetts/UtteranceProcessor.html">UtteranceProcessor</A></H3>
248 <P STYLE="margin-bottom: 0cm">An UtteranceProcessor is any object
249 that implements the UtteranceProcessor interface. An
250 UtteranceProcessor takes as input an Utterance and performs some
251 operation on the Utterance.
253 <P STYLE="margin-bottom: 0cm"><BR>
255 <TABLE WIDTH=100% CELLPADDING=2 CELLSPACING=2>
257 <TD BGCOLOR="#eeeeff">
258 <H2><A NAME="processing"></A>Processing Walkthrough</H2>
262 <P>In this section we will describe the detailed processing performed
263 by the CMUDiphoneVoice. This voice is an unlimited-domain voice that
264 uses diphone synthesis to generate speech. It is derived from the
265 CMUVoice class. The CMUVoice describes the general processing
266 required for an English voice without specifying how unit selection
267 and concatenation is performed. Subclasses of the CMUVoice
268 (CMUDiphoneVoice and CMUClusterUnitVoice) provide this
271 <P>Processing starts with the <FONT FACE="Courier, sans-serif">speak</FONT>
272 method found in <FONT FACE="Courier, sans-serif">com.sun.speech.freetts.Voice</FONT>.
273 The <FONT FACE="Courier, sans-serif">speak</FONT> method performs the
276 <li><a href="#Tokenization"> Tokenization </a>
277 <li><a href="#TokenToWords"> TokenToWords </a>
278 <li><a href="#PartOfSpeechTagger"> PartOfSpeechTagger </a>
279 <li><a href="#Phraser"> Phraser </a>
280 <li><a href="#Segmenter"> Segmenter </a>
281 <li><a href="#PauseGenerator"> PauseGenerator </a>
282 <li><a href="#Intonator"> Intonator </a>
283 <li><a href="#PostLexicalAnalyzer"> PostLexicalAnalyzer </a>
284 <li><a href="#Durator"> Durator </a>
285 <li><a href="#ContourGenerator"> ContourGenerator </a>
286 <li><a href="#UnitSelector"> UnitSelector </a>
287 <li><a href="#PitchMarkGenerator"> PitchMarkGenerator </a>
288 <li><a href="#UnitConcatenator"> UnitConcatenator </a>
291 <H3><a name="Tokenization"> Tokenization</a></H3>
292 <P>In this step, the Voice uses the Tokenizer as returned from the
293 <FONT FACE="Courier, sans-serif">getTokenizer</FONT> method to break
294 a FreeTTSSpeakable object is into a series of Utterances. Typically,
295 tokenization is language-specific so each Voice needs to specify
296 which Tokenizer is to be used by overriding the <FONT FACE="Courier, sans-serif">getTokenizer</FONT>
297 method. The CMUDiphoneVoice uses the
298 c<FONT FACE="Courier, sans-serif">om.sun.speech.freetts.en.TokenizerImpl
299 </FONT>Tokenizer which is designed to parse and tokenize the English
302 <P>A Tokenizer breaks an input stream of text into a series of Tokens
303 defined by the <FONT FACE="Courier, sans-serif">com.sun.speech.freetts.Token
304 </FONT>class. Typically, a Token represents a single word in the
305 input stream. Additionally, a Token will include such information as
306 the surrounding punctuation and whitespace, and the position of the
307 token in the input stream.
309 <P>The English Tokenizer (c<FONT FACE="Courier, sans-serif">om.sun.speech.freetts.en.TokenizerImpl</FONT>)
310 relies on a set of symbols being defined that specify what characters
311 are to be considered whitespace and punctuation.
313 <P>The Tokenizer defines a method called <FONT FACE="Courier, sans-serif">isBreak</FONT>
314 that is used to determine when the input stream should be broken and
315 a new Utterance is generated. For example, the English Tokenizer has
316 a set of rules to detect an end of sentence. If the current token
317 should start a new sentence, then the English Tokenizer <FONT FACE="Courier, sans-serif">isBreak</FONT>
320 <P>A higher level Tokenizer, FreeTTSSpeakableTokenizer repeatedly
321 calls the English Tokenizer and places each token into a list. When
322 the Tokenizer <FONT FACE="Courier, sans-serif">isBreak</FONT> method
323 indicates that a sentence break has occurred, the Voice creates a new
324 Utterance with the current list of tokens. The process of generating
325 and processing Utterances continues until no more tokens remain in
328 <H2 ALIGN=CENTER>Figure 1: The Utterance after Tokenization
330 <P><IMG SRC="images/img0.jpg" NAME="Graphic1" ALIGN=BOTTOM WIDTH=800 HEIGHT=617 BORDER=0>
332 <H3>Utterance Processing
334 <P>A Voice maintains a list of UtteranceProcessors. Each Utterance
335 generated by the tokenization step is run through the
336 UtteranceProcessors for the Voice. Each processor receives as input
337 the Utterance that is being processed. The UtteranceProcessor may add
338 new Relations to the Utterance, add new Items to Relations, or add
339 new FeatureSets to Items or to the Utterance itself. Often times, a
340 series of UtteranceProcessors are tightly coupled; one
341 UtteranceProcessors may add a Relation to an Utterance that is used
344 <P>CMUVoice sets up most of the UtteranceProcessors used by
345 CMUDiphoneVoice. CMUVoice provides a number of <FONT FACE="Courier, sans-serif">getXXX</FONT>
346 methods that return an UtteranceProcessor, such as <FONT FACE="Courier, sans-serif">getUnitSelector</FONT>
347 and <FONT FACE="Courier, sans-serif">getUnitConcatenator</FONT>.
348 Sub-classes of CMUVoice override these <FONT FACE="Courier, sans-serif">getXXX</FONT>
349 methods to customize the processing. For instance, the
350 CMUDiphoneVoice overrides <FONT FACE="Courier, sans-serif">getUnitSelector
351 </FONT>to return a DiphoneUnitsSelector.
354 <H3>CMUDiphoneVoice Utterance Processing
356 <P>The UtteranceProcessors described in this section are invoked when
357 the CMUDiphoneVoice processes an Utterance. When processing begins
358 the Utterance contains the token list and FeatureSets.
360 <H4><a name="TokenToWords"> TokenToWords </a>
362 <P>The TokenToWords UtteranceProcessor creates a word Relation from
363 the token Relation by iterating through the token Relation Item list
364 and creating one or more words for each token. For most tokens there
365 is a one to one relationship between words and tokens, in which case
366 a single word Item is generated for the token item. Other tokens,
367 such as: "2001" generate multiple words "two thousand
368 one". Each word is created as an Item and added to the word
369 Relation. Additionally, each word Item is added as a daughter to the
370 corresponding token in the token Relation.
372 <P>The main role of TokenToWords is to look for various forms of
373 numbers and convert them into the corresponding English words.
374 TokenToWords looks for simple digit strings, comma separated numerals
375 (such as 1,234,567), ordinal values, years, floating point values,
376 and exponential notation. TokenToWords uses the JDK 1.4 regular
377 expression API to perform some classification. In addition a CART
378 (Classification and Regression Tree) is used to classify numbers as
379 one of: year, ordinal, cardinal, digits. Refer to <A HREF="#cart">Classification
380 and Regression Trees</A> for more information on CARTS.
382 <H2 ALIGN=CENTER>Figure 2: The Utterance after TokenToWords
384 <P><IMG SRC="images/img1.jpg" NAME="Graphic2" ALIGN=BOTTOM WIDTH=800 HEIGHT=617 BORDER=0>
386 <H4><a name="PartOfSpeechTagger"> PartOfSpeechTagger </a>
388 <P>The PartOfSpeechTagger UtteranceProcessor is a place-holder
389 processor that currently does nothing.
391 <H2 ALIGN=CENTER>Figure 3: The Utterance after PartOfSpeechTagger
393 <P><IMG SRC="images/img2.jpg" NAME="Graphic3" ALIGN=BOTTOM WIDTH=800 HEIGHT=617 BORDER=0>
395 <H4><A NAME="Phraser"></A>Phraser
397 <P>The Phraser processor creates a phrase Relation in the Utterance.
398 The phrase Relation represents how the Utterance is to be broken into
399 phrases when spoken. The phrase Relation consists of an Item marking
400 the beginning of each phrase in the Utterance. This phrase Item has
401 as its daughters the list of words that are part of the phrase.
403 <P>The Phraser builds the phrase Relation by iterating through the
404 Word Relation created by the TokenToWords processor. The Phraser uses
405 a Phrasing CART to determine where the phrase breaks occur and
406 creates the phrase Items accordingly.
408 <H2 ALIGN=CENTER>Figure 4: The Utterance after Phraser Processing</H2>
409 <P><IMG SRC="images/img3.jpg" NAME="Graphic4" ALIGN=BOTTOM WIDTH=800 HEIGHT=617 BORDER=0>
411 <H4><a name="Segmenter"> Segmenter </a>
413 <P>The Segmenter is one of the more complex UtteranceProcessors. It
414 is responsible for determining where syllable breaks occur in the
415 Utterance. It organizes this information in several new Relations in
418 <P>The Segmenter iterates through each word in the Utterance. For
419 each word, the Segmenter performs the following steps:
422 <LI><P STYLE="margin-bottom: 0cm">Retrieves the phones that are
423 associated with the word from the <A HREF="#lexicon">Lexicon </A>.
424 Each word is organized in a Relation called "SylStructure".
426 <LI><P STYLE="margin-bottom: 0cm">Iterates through each phone of the
427 word, adding the phone to a Relation called "Segment".
429 <LI><P STYLE="margin-bottom: 0cm">Determines where syllable breaks
430 occur (with help from the lexicon) and notes the syllable break
431 points in a Relation called "Syllable"
433 <LI><P>If the lexicon indicates that a particular phone is stressed,
434 then the syllable that contains that phone is marked as "stressed".
437 <P>When the Segmenter is finished, three new Relations have been
438 added to the Utterance that denote the syllable structure and units
441 <H2 ALIGN=CENTER>Figure 5: The Utterance after Segmenter Processing
443 <P><IMG SRC="images/img4.jpg" NAME="Graphic5" ALIGN=BOTTOM WIDTH=800 HEIGHT=617 BORDER=0>
445 <H4><a name="PauseGenerator"> PauseGenerator </a>
447 <P>The PauseGenerator annotates an Utterance with pause information.
448 It inserts a pause at the beginning of the segment list (thus all
449 Utterances start with a pause). It then iterates through the phrase
450 Relation (set up by the <A HREF="#Phraser">Phraser</A>) and inserts a
451 pause before the first segment of each phrase.
453 <H2 ALIGN=CENTER>Figure 6: The Utterance after PauseGenerator
456 <P><IMG SRC="images/img5.jpg" NAME="Graphic6" ALIGN=BOTTOM WIDTH=800 HEIGHT=617 BORDER=0>
458 <H4><a name="Intonator"> Intonator </a>
460 <P>The Intonator processor annotates the syllable Relation of an
461 Utterances with "accent" and "endtone" features.
462 A typical application of this uses the ToBI (tones and break indices)
463 scheme for transcribing intonation and accent in English, developed
464 by Janet Pierrehumbert and Mary Beckman.
466 <P>The intonation is independent of the ToBI annotation: ToBI
467 annotations are not used by this class, but are merely copied from
468 the CART result to the "accent" and "endtone"
469 features of the syllable Relation.
471 <P>This processor relies on two <A HREF="#cart">CARTs </A>: an accent
472 CART and a tone CART. This processor iterates through each syllable
473 in the syllable relation, applies each CART to the syllable and sets
474 the accent and endtone features of the Item based upon the results of
477 <H2 ALIGN=CENTER>Figure 7: The Utterance after Intonator Processing
479 <P><IMG SRC="images/img6.jpg" NAME="Graphic7" ALIGN=BOTTOM WIDTH=800 HEIGHT=617 BORDER=0>
481 <H4><a name="PostLexicalAnalyzer"> PostLexicalAnalyzer </a>
483 <P>The PostLexicalAnalyzer is responsible for performing any fix ups
484 before the next phase of processing. For instance, the
485 CMUDiphoneVoice provides a PostLexicalAnalyzer that performs two
489 <LI><P STYLE="margin-bottom: 0cm"><B>Fix AH </B>The diphone data for
490 the CMUDiphoneVoice does not have any diphone data for the "ah"
491 diphone. The CMU Lexicon that is used by the CMUDiphoneVoice,
492 however, contains a number of words that reference the "ah"
493 diphone. The CMUDiphoneVoice PostLexicalAnalyzer iterates through
494 all phones in the segment Relation and replaces them with "aa"
497 <LI><P><B>Fix Apostrophe-S </B>This step iterates through the
498 segments and looks for words associated with the segments that
499 contain an apostrophe-s. The processor then inserts a 'schwa'
500 phoneme in certain cases.
503 <H4><a name="Durator"> Durator </a>
505 <P>The Durator is responsible for determining the ending time for
506 each unit in the segment list. The Durator uses a CART to look up the
507 statistical average duration and standard deviation for each phone
508 and calculates an exact duration based upon the CART derived
509 adjustment. Each unit is finally tagged with an "end"
510 attribute that indicates the time, in seconds, at which the unit
513 <H2 ALIGN=CENTER>Figure 8: The Utterance after Durator Processing
515 <P><IMG SRC="images/img7.jpg" NAME="Graphic8" ALIGN=BOTTOM WIDTH=800 HEIGHT=617 BORDER=0>
517 <H4><a name="ContourGenerator"> ContourGenerator </a>
519 <P>The ContourGenerator is responsible for calculating the F0
520 (Fundamental Frequency) curve for an Utterance. The paper: <A HREF="http://citeseer.nj.nec.com/20262.html">Generating
521 F0 contours from ToBI labels using linear regression </A>by Alan W.
522 Black, Andrew J. Hunt, describes the techniques used.
524 <P>The ContourGenerator creates the "target" Relation and
525 populates it with target points that mark the time and target
526 frequency for each segment. The ContourGenerator is driven by a a
527 file of feature model terms. For example, CMUDiphoneVoice uses
528 com/sun/speech/freetts/en/us/f0_lr_terms.txt. Here is an excerpt:
530 <PRE>Intercept 160.584961 169.183380 169.570374 null
531 p.p.accent 10.081770 4.923247 3.594771 H*
532 p.p.accent 3.358613 0.955474 0.432519 !H*
533 p.p.accent 4.144342 1.193597 0.235664 L+H*
534 p.accent 32.081028 16.603350 11.214208 H*
535 p.accent 18.090033 11.665814 9.619350 !H*
536 p.accent 23.255280 13.063298 9.084690 L+H*
537 accent 5.221081 34.517868 25.217588 H*
538 accent 10.159194 22.349655 13.759851 !H*
539 accent 3.645511 23.551548 17.635193 L+H*
540 n.accent -5.691933 -1.914945 4.944848 H*
541 n.accent 8.265606 5.249441 7.398383 !H*
542 n.accent 0.861427 -1.929947 1.683011 L+H*
543 n.n.accent -3.785701 -6.147251 -4.335797 H*</PRE><P>
544 The first column represents the feature name. It is followed by the
545 starting point, the mid-point and the ending point for the term (in
546 terms of relative frequency deltas). The final column represents the
549 <P>The ContourGenerator iterates through each syllable in the
550 Utterance and applies the linear regression model as follows:
553 <LI><P STYLE="margin-bottom: 0cm">For each entry in the
554 feature/model/terms table, extract the named feature.
556 <LI><P STYLE="margin-bottom: 0cm">Compare the feature value to the
557 ToBI label as specified in the table.
559 <LI><P STYLE="margin-bottom: 0cm">If the features match, then use
560 the start/midpoint and end to update the curve.
562 <LI><P>Add the new target point to the target Relation
565 <H2 ALIGN=CENTER>Figure 9: The Utterance after ContourGenerator
568 <P><IMG SRC="images/img8.jpg" NAME="Graphic9" ALIGN=BOTTOM WIDTH=800 HEIGHT=617 BORDER=0>
570 <H4><a name="UnitSelector"> UnitSelector </a>
572 <P>The UnitSelector that is used by the CMUDiphoneVoice creates a
573 Relation in the Utterance called "unit". This relation
574 contains Items that represent the diphones for the unit. This
575 processor iterates through the segment list and builds up diphone
576 names by assembling two adjacent phone names. The diphone is added to
577 the unit Relation along with timing information about the diphone.
579 <H2 ALIGN=CENTER>Figure 10: The Utterance after UnitSelector
582 <P><IMG SRC="images/img9.jpg" NAME="Graphic10" ALIGN=BOTTOM WIDTH=800 HEIGHT=617 BORDER=0>
584 <H4><a name="PitchMarkGenerator"> PitchMarkGenerator </a>
586 <P>The PitchMarkGenerator is responsible for calculating pitchmarks
587 for the Utterance. The pitchmarks are generated by iterating through
588 the target Relation and calculating a slope based upon the desired
589 time and F0 values for each Item in the target Relation. The
590 resulting slope is used to calculate a series of target times for
591 each pitchmark. These target times are stored in an LPCResult object
592 that is added to the Utterance.
594 <P><IMG SRC="images/img10.jpg" NAME="Graphic11" ALIGN=BOTTOM WIDTH=800 HEIGHT=617 BORDER=0>
596 <H4><a name="UnitConcatenator"> UnitConcatenator </a>
598 <P>The UnitConcatenator processor is responsible for gathering all of
599 the diphone data and joining it together. For each Item in the unit
600 Relation (recall this was the set of diphones) the UnitConcatenator
601 extracts the unit sample data from the unit based upon the target
602 times as stored in the LPC result.
604 <H2 ALIGN=CENTER>Figure 11: The Utterance after UnitConcatenator
607 <P STYLE="margin-bottom: 0cm"><IMG SRC="images/img11.jpg" NAME="Graphic11" ALIGN=BOTTOM WIDTH=800 HEIGHT=617 BORDER=0>
609 <TABLE WIDTH=100% CELLPADDING=2 CELLSPACING=2>
611 <TD BGCOLOR="#eeeeff">
612 <H2><A NAME="data"></A>FreeTTS Data
617 <P>FreeTTS uses a number of interesting data structures. These are
620 <H3><A NAME="cart"></A>Classification and Regression Trees (CART)
622 <P>The use of Classification and Regression Trees is described in the
623 paper by L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone.
624 <A HREF="http://citeseer.nj.nec.com/context/7119/0"><I>Classification
625 and Regression Trees.</I></A> Additional information about how such
626 trees can be used in the context of speech synthesis is described in
627 <A HREF="http://festvox.org/docs/speech_tools-1.2.0/c16616.htm">Chapter
628 10 </A>of the System Documentation of the <I>Edinburgh Speech Tools
629 Library</I>. The Classification and Regression Trees (CART) in FreeTTS
630 are essentially binary decision trees used to classify some part of
633 <P>A CART is a tree of nodes and leaves. Each node consists of the
637 <LI><P><B>Feature </B>The feature to test. This is in the form of a
638 feature traversal string. For instance the feature string:
640 <PRE STYLE="margin-bottom: 0.5cm">"R:SylStructure.daughter.R:Segment.p.end"</PRE><P STYLE="margin-bottom: 0cm">
641 Can be interpreted as:<BR><BR>
644 <P STYLE="margin-left: 2cm; margin-bottom: 0cm"><I>Given an Item in
645 the syllable relation (syl), find the SylStructure Relation in that
646 syllable, get the first daughter, find the segment associated with
647 the daughter, find the previous segment and return its "end
648 time".</I> <BR><BR>
651 <LI><P STYLE="margin-bottom: 0cm"><B>Operand </B>The type of test to
652 perform. The available operands are:
655 <LI><P STYLE="margin-bottom: 0cm">LESS_THAN - The feature is less
658 <LI><P STYLE="margin-bottom: 0cm">EQUAL - The feature is equal to
661 <LI><P STYLE="margin-bottom: 0cm">GREATER_THAN - The feature is
662 greater than the value
664 <LI><P STYLE="margin-bottom: 0cm">MATCHES - The feature matches the
665 regular expression stored in the value
668 <LI><P STYLE="margin-bottom: 0cm"><B>Value </B>The feature value is
669 compared based on the operand to this value.
671 <LI><P STYLE="margin-bottom: 0cm"><B>Success Node </B>If the
672 comparison is successful, tree traversal continues at this node.
674 <LI><P STYLE="margin-bottom: 0cm"><B>Failure Node </B>If the
675 comparison fails, traversal continues at this node.
677 <LI><P><B>Type </B>A node can be of two types, a NODE or a LEAF. A
678 NODE is a non-terminal member of the tree, whereas a LEAF is a
679 terminal node. Once the interpretation of a CART reaches a LEAF
680 node, the value for that node is returned.
683 <P>Typically, an UtteranceProcessor will employ a CART tree to
684 classify a particular Item or part of an Utterance. The CART
685 processing proceeds as follow:
688 <LI><P STYLE="margin-bottom: 0cm">Starting at the first node in the
689 CART, extract the feature pointed to by the node.
691 <LI><P STYLE="margin-bottom: 0cm">Compare based upon the NODE
692 operand to the node value. If the comparison succeeds proceed to the
693 Success Node, otherwise go to the Failure Node.
695 <LI><P>Continue processing nodes in this fashion until a LEAF node
696 is reached, at which point return the value of that node.
700 <TABLE WIDTH=100% BORDER=1 CELLPADDING=2 CELLSPACING=2>
703 <P>CARTS used by FreeTTS
731 <P>en/us/durz_cart.txt
739 <P>used to determine where to place breaks in phrases.
749 <P>en/us/int_accent_cart.txt
757 <P>used to determine where to apply syllable accents .
767 <P>en/us/int_tone_cart.txt
775 <P>used to determine the type of 'end tone' for syllables.
793 <P>used to determine the duration for each segment of an
804 <P>en/us/nums_cart.txt
812 <P>used to classify numbers as cardinal, digits, ordinal or year.
818 <P>ClusterUnitSelection
822 <P>en/us/cmu_awb/cmu_time_awb.txt
826 <P>130 CARTS, 2 nodes each
830 <P>The cluster unit database contains 132 separate CART trees,
831 each of which contains just a couple or so nodes. These CARTS are
832 used to select phoneme units.
838 <H3><a name="lexicon">Lexicon</a>
840 <P>The Lexicon provides a mapping of words to their pronunciations.
841 FreeTTS provides a generic lexicon interface
842 (<FONT FACE="Courier, sans-serif">com.sun.speech.freetts.lexicon)</FONT>
843 and a specific implementation, <FONT FACE="Courier, sans-serif">com.sun.speech.freetts.en.us.CMULexicon</FONT>
844 that provides a English language lexicon based upon CMU data. The
845 essential function of a Lexicon is to determine the pronunciation of
846 a word. The retrieval is done via the interface: Lexicon.getPhones.
849 <P>The Lexicon interface provides the ability to add new words to the
852 <P>The CMULexicon is an implementation of the Lexicon interface that
853 supports the Flite CMU Lexicon. The CMULexicon contains over 60,000
854 pronunciations. Here is a snippet:
856 <PRE>abbasi0 aa b aa1 s iy
858 abbatiello0 aa b aa t iy eh1 l ow
865 abbreviate0 ax b r iy1 v iy ey1 t
866 abbruzzese0 aa b r uw t s ey1 z iy
871 abdicating0 ae1 b d ih k ey1 t ih ng </PRE><P>
872 Each entry contains a word, with a part-of-speech tag appended to it,
873 followed by the phones representing the pronunciation. A separate
874 file maintains the addenda. The addenda is a smaller set of
875 pronunciations typically used to provide custom or application or
876 domain specific pronunciations.
878 <P>The CMULexicon implementation also relies on a set of
879 Letter-To-Sound rules. These rules can automatically determine the
880 pronunciation of a word. When the pronunciation of a word is
881 requested, the CMULexicon will first look it up in the main list of
882 words. If it is not found, the addenda is checked. If the word is
883 still not found, then the Letter-To-Sound rules are used to convert
884 the words into phones. To conserve space, the CMULexicon has been
885 stripped of all words that can be recreated using the Letter-To-Sound
886 rules. One can look at the 60,000 pronunciations in the Lexicon as
887 exceptions to the rule.
889 <P>The Lexicon data is represented in two forms: text and binary. The
890 binary form loads much quicker than the text form of the data and is
891 the form that is generally used by FreeTTS. FreeTTS provides a method
892 of generating the binary form of the Lexicon from the text form of
895 <H3>Letter-To-Sound Rules
897 <P>The Letter-To-Sound (LTS) rules are used to generate a phone
898 sequence for words not in the Lexicon. The LTS rules are a simple
899 state machine, with one entry point for each letter of the alphabet.
901 <P>The state machine consists of a large list of entries. There are
902 two types of entries: a STATE and a PHONE. A STATE entry contains a
903 decision and the indices of two other entries. The first of these two
904 indices represents where to go if the decision is true, and the
905 second represents where to go if the decision is false. A PHONE entry
906 is the final state of the decision tree and contains the phone that
909 <P>The decision in FreeTTS's case is a simple character comparison,
910 but it is done in the context of a window around the character in the
911 word. The decision consists of a index into the context window and a
912 character value. If the character in the context window matches the
913 character value, then the decision is true. The machine traversal for
914 each letter starts at that letter's entry in the state machine and
915 ends only when it reaches a final state. If there is no phone that
916 can be mapped, the phone in the final state is set to 'epsilon.' The
917 context window for a character is generated in the following way:
920 <LI><P STYLE="margin-bottom: 0cm">Pad the original word on either
921 side with '#' and '0' characters to the size of the window for the
922 LTS rules (in FreeTTS's case, this is 4). The "#" is used
923 to indicate the beginning and end of the word. So, the word "monkey"
924 would turn into "000#monkey#000".
926 <LI><P>For each character in the word, the context window consists
927 of the characters in the padded form that precede and follow the
928 word. The number of characters on each side is dependent upon the
929 window size. So, for FreeTTS, the context window for the 'k' in
930 monkey is "#money#0".
933 <P>Here's how the phone for 'k' in 'monkey' might be determined:
936 <LI><P STYLE="margin-bottom: 0cm">Create the context window
937 "#monkey#0".
939 <LI><P STYLE="margin-bottom: 0cm">Start at the state machine entry
940 for 'k' in the state machine.
942 <LI><P STYLE="margin-bottom: 0cm">Grab the 'index' from the current
943 state. This represents an index into the context window. Compare the
944 value of the character at the index in the context window to the
945 character from the current state. If there is a match, the next
946 state is the true value. If there is not a match, the next state is
949 <LI><P STYLE="margin-bottom: 0cm">Repeat the previous step until you
952 <LI><P>When you get to the final state, the phone is the character
958 <P>The designers of FreeTTS have written it in such a way that the
959 unit selection can be done using several methods. The current methods
960 are diphone and cluster unit selection.
962 <P>Luckily, the unit selection is independent of the wave synthesis.
963 As a result, if the units from either unit selection type share the
964 same format, the same wave synthesis technique can be used. This is
965 the case for the KAL diphone and AWB cluster unit voices.
967 <H4>Diphone Unit Selection
969 <P>The diphone unit selection is very simple: it combines each
970 adjacent phoneme into a pair separated by a "-". These
971 pairs are used to look up entries in the diphone database.
973 <H4>Cluster Unit Selection
975 <P>The cluster unit selection is a bit more complex. Instead of
976 working with diphones, it works on one unit at a time, and there can
977 be more than one instance of a unit per database.
979 <P>The first step in cluster unit selection determines the unit type
980 for each unit in the Utterance. The unit type for selection in the
981 simple talking clock example (cmu_time_awb) is done per phone. The
982 unit type consists of the phone name followed by the word the phone
983 comes from (e.g., n_now for the phone 'n' in the word 'now').
985 <P>The unit database contains a plurality of instances per unit type,
986 and they are indexed by number (e.g., n_now_0, n_now_1, etc.). Also
987 included in this database are what unit instances come before and
988 after each unit (e.g., n_now_13 is preceded by z_is_13 and is
989 followed by unit_aw_now_13).
991 <P>Once the unit types have been determined, the next step is to
992 select the best unit instance. This is done using a Viterbi algorithm
993 where the cost is based upon the Mel-Cepstra distance between
994 candidates. The candidate selection is determined using two things:
997 <LI><P STYLE="margin-bottom: 0cm">A CART - given an item, the CART
998 will return a list of the unit type instances that are potential
999 choices. Most of the CARTs in cmu_time_awb are very simple - there
1000 are no choices and the first node is a leaf node containing the
1003 <LI><P>Extended selections. For the unit preceding the current unit,
1004 the candidate selection will search the units following that unit.
1005 If the unit type is the same as the current unit, then that unit is
1006 added as a candidate.
1009 <P>After the candidates are chosen, the Viterbi algorithm is used to
1010 calculate path costs. The basic algorithm is as follows:
1013 <LI><P STYLE="margin-bottom: 0cm">For each candidate for the current
1014 unit, calculate the cost between it and the first candidate in the
1015 next unit. Save only the path that has the least cost. By default,
1016 if two candidates come from units that are adjacent in the database,
1017 the cost is 0 (i.e., they were spoken together, so they are a
1020 <LI><P STYLE="margin-bottom: 0cm">Repeat the previous process for
1021 each candidate in the next unit, creating a list of least cost paths
1022 between the candidates between the current unit and the unit
1025 <LI><P STYLE="margin-bottom: 0cm">Toss out all candidates in the
1026 current unit that are not included in a path.
1028 <LI><P>Move to the next unit and repeat the process.
1031 <P>Once the whole tree is done, the path(s) with the least cost
1032 should be identified, and these represent the RELP encoded samples to
1033 choose from the database.
1035 <TABLE WIDTH=100% CELLPADDING=2 CELLSPACING=2>
1037 <TD BGCOLOR="#eeeeff">
1038 <H2><A NAME="code"></A>CMUDiphoneVoice Code Walkthrough
1043 <P>In this section, we will look at the CMUDiphoneVoice class to see
1044 how a new voice is created and customized.
1048 * Defines an unlimited-domain diphone synthesis based voice
1050 <FONT COLOR="#008000">The CMUDiphoneVoice class extends the CMUVoice.
1051 CMUVoice provides much of the standard voice definition including
1052 loading of the Lexicon, setting up of common features, setting up of
1053 UtteranceProcessors. </FONT>
1055 <PRE>public class CMUDiphoneVoice extends CMUVoice {
1058 * Creates a simple voice
1061 <FONT COLOR="#008000">It is possible and quite likely that multiple
1062 voices will want to share a single Lexicon. By passing false to the
1063 CMUVoice constructor, this voice indicates that by default no Lexicon
1064 should be created. This allows a Voice manager (such as the
1065 FreeTTSSynthesizer to create a single Lexicon and have multiple voices
1068 <PRE> public CMUDiphoneVoice() {
1073 * Creates a simple voice
1075 * @param createLexicon if true automatically load up
1076 * the default CMU lexicon; otherwise, don't load it.
1078 <FONT COLOR="#008000">This version of the constructor sets the
1079 standard rate, pitch and range values for the voice. </FONT>
1081 <PRE> public CMUDiphoneVoice(boolean createLexicon) {
1082 super(createLexicon);
1089 * Sets the FeatureSet for this Voice.
1091 * @throws IOException if an I/O error occurs
1093 If this voice needed to add or customize the feature set, it would do
1094 so here. This voice is happy with the default features provided by
1095 CMUVoice, so nothing is done here.
1097 <PRE> protected void setupFeatureSet() throws IOException {
1098 super.setupFeatureSet();
1102 * Returns the post lexical processor to be used by this voice.
1103 * Derived voices typically override this to customize behaviors.
1105 * @return the Unit selector
1107 * @throws IOException if an IO error occurs while getting
1110 <FONT COLOR="#008000">Here is an example of how to override the
1111 default Utterance processing provided by CMUVoice. CMUDiphoneVoice
1112 needs to provide a post-lexical analyzer that converts one phone "ah"
1113 to another "aa". CMUVoice provides a number of 'getXXXX'
1114 functions that return the UtteranceProcessor that will be used for
1115 that stage of processing. CMUDiphoneVoice overrides the
1116 getPostLexicalAnalyzer method to provide the customized post lexical
1119 <PRE> protected UtteranceProcessor getPostLexicalAnalyzer() throws IOException {
1120 return new CMUDiphoneVoicePostLexicalAnalyzer();
1124 * Returns the pitch mark generator to be used by this voice.
1125 * Derived voices typically override this to customize behaviors.
1126 * This voice uses a DiphonePitchMark generator to generate
1129 * @return the post lexical processor
1131 * @throws IOException if an IO error occurs while getting
1135 <FONT COLOR="#008000">The diphone voice needs to provide a customized
1136 pitchmark generator that is specific to diphone synthesis.</FONT>
1138 <PRE> protected UtteranceProcessor getPitchmarkGenerator() throws IOException {
1139 return new DiphonePitchmarkGenerator();
1143 * Returns the unit concatenator to be used by this voice.
1144 * Derived voices typically override this to customize behaviors.
1145 * This voice uses a relp.UnitConcatenator to concatenate units.
1147 * @return the post lexical processor
1149 * @throws IOException if an IO error occurs while getting
1153 <FONT COLOR="#008000">This voice uses the standard UnitConcatenator. </FONT>
1155 <PRE> protected UtteranceProcessor getUnitConcatenator() throws IOException {
1156 return new UnitConcatenator();
1161 * Returns the unit selector to be used by this voice.
1162 * Derived voices typically override this to customize behaviors.
1163 * This voice uses the DiphoneUnitSelector to select units. The
1164 * unit selector requires the name of a diphone database. If no
1165 * diphone database has been specified (by setting the
1166 * DATABASE_NAME feature of this voice) then by default
1167 * cmu_kal/diphone_units.bin is used.
1169 * @return the post lexical processor
1171 * @throws IOException if an IO error occurs while getting
1174 <FONT COLOR="#008000">The unit selector is also diphone specific.
1175 Note that this method also specifies which diphone unit database to
1176 use if none has been supplied already. </FONT>
1178 <PRE> protected UtteranceProcessor getUnitSelector() throws IOException {
1179 String unitDatabaseName = getFeatures().getString(DATABASE_NAME);
1181 if (unitDatabaseName == null) {
1182 unitDatabaseName = "cmu_kal/diphone_units.bin";
1185 return new DiphoneUnitSelector(
1186 this.getClass().getResource(unitDatabaseName));
1191 * Converts this object to a string
1193 * @return a string representation of this object
1195 public String toString() {
1196 return "CMUDiphoneVoice";
1202 * Annotates the Utterance with post lexical information. Converts AH
1203 * phonemes to AA phoneme in addition to the standard english postlex
1207 <FONT COLOR="#008000">Here is an example of defining a new
1208 UtteranceProcessor. This UtteranceProcessor traverses through the
1209 SEGMENT Relation looking for all phones of type "ah" and
1210 converts them to "aa" phones. Since this Processor is used
1211 to replace the default post-lexical analyzer processor, it invokes
1212 the default post-lexical analyzer after performing the custom
1215 <PRE>class CMUDiphoneVoicePostLexicalAnalyzer implements UtteranceProcessor {
1216 UtteranceProcessor englishPostLex =
1217 new com.sun.speech.freetts.en.PostLexicalAnalyzer();
1220 * performs the processing
1221 * @param utterance the utterance to process/tokenize
1222 * @throws ProcessException if an IOException is thrown during the
1223 * processing of the utterance
1225 public void processUtterance(Utterance utterance) throws ProcessException {
1226 fixPhoneme_AH(utterance);
1227 englishPostLex.processUtterance(utterance);
1232 * Turns all AH phonemes into AA phonemes.
1233 * This should really be done in the index itself
1234 * @param utterance the utterance to fix
1236 private void fixPhoneme_AH(Utterance utterance) {
1237 for (Item item = utterance.getRelation(Relation.SEGMENT).getHead();
1239 item = item.getNext()) {
1240 if (item.getFeatures().getString("name").equals("ah")) {
1241 item.getFeatures().setString("name", "aa");
1246 // inherited from Object
1247 public String toString() {
1248 return "PostLexicalAnalyzer";
1251 <TABLE WIDTH=100% CELLPADDING=2 CELLSPACING=2>
1253 <TD BGCOLOR="#eeeeff">
1254 <H2><A NAME="packaging"></A>Voice Packaging
1259 <P> FreeTTS has been designed to allow flexible and dynamic addition of voices.
1261 <H3>Voice Packages</H3>
1263 Voices are defined by their corresponding VoiceDirectories. These directories
1264 are what actually create the instances of the voices, and can create several
1265 different voices. This is useful if a single voice can sound dramatically
1266 different if it is created through different parameters. Then the directory
1267 can return more than one instance of the same voice class though they may
1268 sound different. It may also be useful for the same voice package to contain
1269 more than one voice, allowing a single interface to those voices.
1270 The voice directory MUST also provide a main() function that will print
1271 out information about the voice if invoked. Typically this is done by
1272 simply calling the VoiceDirectory's toString() method.
1274 <P> A voice package is a jar file that contains exactly one subclass of
1275 VoiceDirectory. The package probably contains data files as well as other
1276 java classes that implement the voices provided. The jarfile Manifest
1277 must also include three entries:</P>
1279 <LI><P>"Main-Class" which will be the VoiceDirectory class,
1280 and prints out information about the voices provided</P>
1281 <LI><P>"FreeTTSVoiceDefinition: true", which informs
1282 FreeTTS that this jarfile is a voice package</P>
1283 <LI><P>"Class-Path:" which lists all the jars upon which
1284 this voice package is dependent. For example, the voice may
1285 be dependent upon its lexicon jarfile. This allows a user
1286 to simply execute the main() function without having to specify
1287 all of the dependencies (which the user may not know).</P>
1289 <H3>Installing Voice Packages</H3>
1290 <P>Voice Packages can be added to FreeTTS without any compilation. There
1291 are two ways to alert FreeTTS to the presence of a new voice:</P>
1293 <LI><P>Listing of the VoiceDirectory classes that are loaded.</P>
1294 <LI><P>Putting the packages in the correct directory and allowing
1295 FreeTTS to automatically detect them.
1296 [[[TODO: This is not yet implemented. For now use the listing method.]]]
1299 <P>Listing of the VoiceDirectory classes requires all of the required classes
1300 to be appropriately in the java classpath. The names of the voice directory
1301 classes are listed in voices files. When VoiceManager.getVoices() is
1302 called, reads several files.</P>
1304 <LI><P>First, it looks for internal_voices.txt, stored
1305 in the same directory as VoiceManager.class (If the VoiceManager is in a
1306 jarfile, which it probably is, then this file is also inside the jar file).
1307 If the file does not exist, FreeTTS moves on. internal_voices.txt only
1308 exists to allow one to package FreeTTS into a single stand-alone jarfile,
1309 as may be needed for applets. Avoid using internal_voices.txt if
1310 at all possible. The file then requires you to ship all listed voices
1311 along with FreeTTS and provides minimal flexibility.</P>
1313 <LI><P>Next, FreeTTS looks for voices.txt in the same directory as
1314 freetts.jar (assuming FreeTTS is being executed from a jar, which
1315 it probably is). If the file does not exist, FreeTTS moves on.</P>
1317 <LI><P>Last, if the system property "freetts.voicesfile"
1318 is defined, then FreeTTS will use the voice directory classes
1319 listed in that file.</P>
1322 <P>Voice packages can also be recognized simply by putting them in
1323 the correct filesystem directory.
1324 [[[TODO: At least, that is the plan. This is not yet actually implemented.]]]
1325 If a jarfile is in the correct directory
1326 and has the "FreeTTSVoiceDefinition: true" definition
1327 in its Manifest, then it is assumed to be a voice package. The file
1328 is then loaded along with all dependencies listed in the
1329 "Class-Path:" definition. Whatever class is listed as the
1330 "Main-Class:" is assumed to be the voice directory. There
1331 are two ways to specify which filesystem directory to look in:</P>
1333 <LI><P>By default, FreeTTS will look in the same directory as
1334 freetts.jar. (Assuming FreeTTS was loaded from a jarfile, which
1335 it probably was).</P>
1336 <LI><P>The directories specified by the system property
1337 "freetts.voicespath".</P>
1340 <H3>Compiling Voice Packages</H3>
1341 <P>To create a voice package you simply need to meet the qualifications
1342 above. However that can be a bit of work. If you want to import a voice
1343 from FestVox, there are tools in tools/FestVoxToFreeTTS/. View the
1344 README file there for more information. The scripts can automatically
1345 import a US/English voice, but is not designed to handle others. For the
1346 simple case of US/English voices, simply put them in a subdirectory of
1347 com/sun/speech/freetts/en/us/. Files ending with ".txt" will
1348 be assumed to be data files for the voice and compiled into their
1349 ".bin" and ".idx" equivalents. The file
1350 "voice.Manifest" will automatically be added to the Manifest
1351 of the voice package's jarfile. The compilation system will automatically
1352 detect new directories inside en/us, assume they are voice packages,
1353 and create new jarfiles for them.</P>
1356 <P STYLE="margin-bottom: 0cm">See the <A HREF="../license.terms">license
1357 terms</A> and <A HREF="../acknowledgments.txt">acknowledgments</A>.<BR>Copyright
1358 2003 Sun Microsystems, Inc. All Rights Reserved. Use is subject to