In this practical exercise you will be designing, building and evaluating an application to demonstrate Speech Synthesis. The best demos produced may be invited to feature in the Edinburgh International Science Festival. The Science Festival is aimed at the general public, and particularly at children. This extract from their web site gives an idea of the context in which the demo will be used, the level at which to pitch your demo, and the impact required from it.
Our aims are to give the children of Edinburgh and Scotland experiences of science that are inspiring and confidence building and to engage all of society in the wonder and value of science. Our passion is creating "Ah-ha" moments that illuminate the mysteries of our world. All our people strive hard to offer science and technology that children and grown-ups love to do.
To be effective, the demo should be engaging and entertaining as well as interesting and informative. Most importantly, the demo has to be easy to use --- as near to a walk-up-and-use interface as possible. Below we provide more details on the required and optional capabilities that the demo should or could include.
You should complete the exercise in pairs; this will allow you to divide the work up and discuss design decisions. Please find a partner for the practical and contact us immediately if you cannot.
The demo must be implemented in Java, for example, using JFC/Swing. We have provided a simple example interface which you may use to get started.
There are two parts to the practical: (1) design and implementation and (2) user evaluation. You should submit your working system and a report explaining the design process and decisions at the end of week 6 (Friday 31st October) and a second report describing the results of the user evaluation at the end of week 9 (Friday 21st November). Further details about the expected contents of these reports and the submission process are given below. Please read through the whole document before starting work.
For a general background on speech synthesis, have a look at the Wikipedia entry for Speech Synthesis. The Speech Synthesis system we will use in this practical is provided by an Edinburgh based Speech Synthesis company, CereProc. It is a concatenative, unit selection system (see the wiki article for an explanation of these terms) capable of producing very natural sounding speech - as you can hear in these example recordings of CereVoice.
Please look at the paper The CereVoice Characterful Speech Synthesiser SDK for more details.
cp -r /group/teaching/hci/07_level10_practical/hci_practical .(don't forget the final dot!). Then:
To set the paths correctly, open "paths.xml" and edit the line referring to the files_io directory:
by replacing $USER with your username.
For these steps your working directory should be the hci_practical/ you copied above.
These scripts invoke javac and java with the correct path settings. You may need to adjust these scripts later to include additional Java source files.
While the GUI starts you will see some output in the terminal window, which should include:
----------------------------------------------- Loading Cerevoice libraries... ** Loaded cerevoice_aud library ** Loaded cerevoice library ** Loaded expat library ** Loaded cerevoice_io library ----------------------------------------------- ***************Eliza read script**********************
Note that the voice won't be ready to speak until you see:
INFO: finished loading voice, starting server on port 1314
Put your headphones on and check the volume levels using the desktop volume control.
If the terminal output includes:
There was a problem reading the script file
check that you edited the file paths.xml correctly to point to your hci_practical/files_io/ directory (see instructions above).
If the terminal output includes:
Could not load Cerevoice wrappers
it is likely that some required components are not accessible. The code will only run on DICE machines which have access to paths inside /group/teaching/hci/.
The provided interface has a text field to type in, a text area for a history of user input and chatbot output, and some buttons. Notice that thebutton speaks if you hover the mouse over it, and again when you move the mouse off it.
The basic operation is to enter text and have it read back by the voice using thebutton.
The next level of interaction is to tune the synthesis output to provide different tones of voice, pauses, and variant renderings. At present this can only be done by inserting tags into the text: the basic task for this practical is for you to implement a more user-friendly way of allowing this (e.g. via buttons). To get an idea of what different tags do, try some of the following:
You can put tags round text (from single words to complete utterances) that you want to be read with a certain emotion.
Calm voice: <usel genre='calm'>OK</usel>
Stressed voice: <usel genre='stressed'>OK</usel>
Inserting punctuation such as commas, parentheses, full-stops will get an associated type and length of pause inserted automatically. The system will also insert pauses in long sentences with no punctuation at likely locations. To get greater control over the location, type and length of pauses you can insert break tags between words; these affect both the prosody of the speech and the length of the pause.
Short Break: <break type='2' time='0.1'/>
Phrase Break: <break type='3' time='0.3'/>
Sentence Break: <break type='4' time='0.6'/>
You can set a longer or shorter value, those above are the defaults for each break type. A value of 0.0 is sometimes useful when you want to tweak the prosody without actually inserting a pause.
You can put `variant' tags round a word to get a different rendering. This is useful for getting just the right choice for a particular context. You can put a tag round groups of words too, but generally enclosing a single word is more effective.
You can enter higher numbers, to hear further variants. If you preferred the first rendering, and don't want to clear the tags, change the variant number to "0".
Keep in mind that these different tags have different characteristics (e.g. categorical, continuous, numerical) so you might want to chose different interaction methods for them in your interface.
Thebutton sends a file to be synthesised. Files sent for synthesis should be in XML format as in the provided example in files_io/in/info/wikipedia_speech_synthesis.xml.
The speakingbutton suggests one way that speech can be incorporated into the interface, but also illustrates how such features can be annoying or confusing if set off unintentionally by the user.
Finally, the sample interface provides a 'fun' extension in which the speech produced is not what you typed in, but the reply produced by the Eliza program. Type "Hello" and then pressfor a conversation.
The Java files for the GUI are in the package/directory uk/ac/ed/inf/hci_synth_gui. The main class in SpeechSynthesisDemo.java starts the demo and loads the voice. To see how the GUI itself works, look at UserInterface.java. You can either use this file as the basis for making the changes for your interface, or start from scratch.
InputPacker.java puts input into the correct format to send for text normalisation. The following files which deal with normalisation, homograph disambiguation, and speech synthesis, should not be edited: NormFunctions.java, SpeechClient.java, SpeechRequestHandler.java, SynthFunctions.java, SpeechServer.java.
The chatbot is based on Joseph Weizenbaum's Eliza program. Look at the script in files_io/in/script and the chatbot code in net/chayden/eliza/ to see how the conversation works. (This was adapted from code available online: overview and instructions on how to modify the script).
For more details on how the speech synthesis system we are using works please look at the paper The CereVoice Characterful Speech Synthesiser SDK. It includes an overview of the system and has some examples using tags. Note that the Java wrapped version of CereVoice which we are using in this practical does not yet include the <voice emotion='happy'> or sig (amplitude, f0 and rate) tags mentioned in that paper. But you can use all the tags we have shown in the instructions above, as well as lex tags to set pronunciations. If you would like to experiment with setting pronunciations, (not required fouseful if you want to synthesise unusual words or if you want to override the pronunciation that the lexicon returns), have a look at the file hci_practical/scottish_phones.txt for more information including the characters you can use to set pronunciations with the Scottish voice we are using in this practical.
In this part you will design and build the demo system. The requirements are to:
Provide a friendly interface that allows a naive user to enter text, hear it read back, and then to modify the emotion, pauses and rendering and hear the different results (obviously, this should not require them to type in the tags).
This is the minimum you must do for a pass mark on Part 1.
The interface extension is open-ended. You should plan the design process carefully: you may try to involve a target user and use 'lo-fi' (e.g., pencil-and-paper) prototypes for testing. See remarks below on time management and using an IDE.
You should submit the relevant code and files to allow us to compile and run the demo, accompanied by a report of no more than 2000 words describing the design process, justifying design decisions (e.g. explaining features that you considered but decided not to implement), and assessing the strengths and weaknesses of what you have produced. To submit, one of the pair should use the electronic submission command like this:
submit cs4 hci-4 1 hci-practical.zip report1.pdf
The zip file must contain shell scripts compile.sh and run.sh so we can compile and run your interface. The report must be in PDF format and on the first page should give the name and matriculation number of both members of the pair. Please stick to the suggested file names for our convenience.
We will normally allocate both people the same mark. To notify us of an alternative allocation you have agreed, please put a table on the first page of the report with a percentage split. However, please only do this in extraordinary cases, and be aware that a clever design idea in this practical may have more impact on the mark than many hours of coding, so allocation by hours spent may not be the most appropriate.
The deadline for Part 1 is Friday 31st October, 4pm.
A think-aloud or cooperative evaluation (see Dix et al 9.4.3) with at least one naive user, who to the extent possible, should closely resemble a member of the target audience (i.e. a member of the general public who might attend a science festival). In this you should gather detailed feedback on how the system is used and any problems that arise. You should give the user a set of structured tasks and prompt them so as to ensure they explore the whole system.
This is the minimum you must do for a pass mark on Part 2.
Design a set of objective measures and closed questions and apply them to quantify the results for at least 5 subjects who freely interact with the system (i.e. as might happen if they encountered it at a science festival, so possibly only for a few minutes or not using all features). For example, how long do subjects play with the system? How many of the features do they discover? How many errors do they make? How do they rate it for fun? Have they learned anything about speech synthesis? Ideally these subjects should at least be naive to the aims of the practical (i.e. not fellow students on the HCI course). You should aim to make this evaluation fairly short and easy for the subjects to complete (e.g. a questionnaire that they might actually be willing to fill in if they were at a science festival).
One option you might like to consider is collaborating with another group to either evaluate each other's system, or to use the same measurement protocol to make a comparison between your two systems. You will still need to submit one report per pair.
You should submit a report of no more than 2000 words explaining the evaluation methods chosen, reporting the results, and discussing them in relation to your previous assessment of the system. To submit, one of the pair should use the electronic submission command like this:
submit cs4 hci-4 2 report2.pdf
The report must be in PDF format and on the first page should give the name and matriculation number of both members of the pair. Please stick to the suggested file name for our convenience.
As in Part 1, we will normally allocate both people the same mark. To notify us of an alternative allocation you have agreed, please put a table on the first page of the report with a percentage split.
The deadline for Part 2 is Friday 21st November, 4pm.
This practical is worth 30% of your mark for the course; it should not take you significantly more than 30 hours per person to complete both parts. Design tasks can consume an arbitrary amount of time so you need to be disciplined and plan how much time to allocate. Time constraints will have a big impact on what you can implement, so familiarise yourself early on with the GUI framework to get an idea of the degree of difficulty for any proposed design. Start your design on paper, as you should have a clear idea of what you want before you begin coding. Make sure the process is iterative --- test (and perhaps evaluate) each part as you build it. Avoid getting obsessed with perfecting low-level aesthetics (e.g. choosing fonts or designing icons) at the expense of good interaction design and usability.
You may, if you wish, use an IDE with an interface builder, such as NetBeans or Eclipse. Note that Eclipse does not include an interface builder by default, although plugins are available for assisting with SWT/JFace or Swing/AWT UI design.
Here are some pointers for NetBeans (the version installed on DICE is NetBeans IDE 5.5):
Important note: If you use an IDE, you must ensure that you are able to make a standalone Java application which can be compiled and run outside the IDE similarly to the one we have provided, so that we can build and run your code easily. It must run and compile on a standard DICE machine. Check that you can do this before getting too far. If you are unsure of how to do it, or concerned it will take too much time, please use hand crafted Swing code.