Announcing meSpeak.js 2.0

August 10, 2019

Text to Speech in JS, now even better.

I am happy to announce a long planned for update to meSpeak.js, an open source TTS for the Web in JavaScript. This version brings some major update (and some minor discontinuities in API as well.) In a nutshell, meSpeak.js is the open source eSpeak program crosscompiled to JS using Emscripten (a minimal POSIX runtime to run LLVM compiler output in JS) running in the browser with some additional API glued on top. MeSpeak.js is based on speak.js, which has been an early demo application for Emscripten, but differs somewhat in architecture and features (like access to the entirety of eSpeak options, facilities for export and/or buffering of audio data, a built-in audio playback API, modular voice and language descriptions, etc) and also in compatibility.

Announcing meSpeak 2.0 — *Announcing meSpeak.js 2.0.*

And here are the major features of the update:

Concurrent Web Worker
First of all, meSpeak.js features now a modular architecture consisting of two parts, a front-end (“mescript.js”) and an application core (“mespeak-core.js”), which is loaded automatically by the fronte-end. The application core contains the Emscripten-port and basic communication facilities, and features a dual personality: In case the browser supports Web Workers, the core will run as a worker concurrently in the background, otherwise, it will be loaded as an instance, running in the main thread (as before). This means, the application will run concurrently in any modern web browser and will only occupy the UI thread for resolving API calls and managing audio playback. However, we still maintain compatibility with older clients. (BTW, minimal requirements are a basic support of typed arrays and capabilities to playback wav-files, either via the Web Audio API or using an HTML audio-element. This means basic HTML5 capabilities and a pre-ES5 JS-support, essentially anything from 2011 and newer.)
Update: Sadly, it turns out that mobile devices (iOS, etc) will mute the playback of a sound triggered by a postMessage-event from a worker. While concurrent processing may have been especially usefull on mobile, I had to disable workers for those devices. (On the other hand, meSpeak.js now contains a simple mobile unlocker, playing a short, inaudible sound on the first touchstart event, if applicable.)
Smaller File Size
Thanks to a more aggressive compression of the Emscripten core, the file size is now dramatically reduced (all-in-all about 500K g-zipped).
Simpler Loading (no config anymore)
One of the first changes to speak.js was the ability to load modular configuration and voice/language definitions. However, only few users would bare the hassles of compiling a custom configuration file. Therefore, the standard configuration is now included in the core application. However, there is a new API call to load a custom configuration file, overriding any of the included standard definitions,
- meSpeak.loadCustomConfig(<url> [, callback])”.
The loading process is also enhanced as you now may load a voice-definition either by its URL (as previously) or just by its “voice-ID” (loading the respective JSON file from inside the “voices” directory in your meSpeak-installation.) Meaning, these two calls are synonymous:
- meSpeak.loadVoice('voices/en/en-us.json', myCallbackFuntion);
- meSpeak.loadVoice('en/en-us', myCallbackFuntion);
Compatibility Warning: Voice Paths
However, as a small drawback, there’s a change regarding load paths (this also applies to loading custom config-files). This is mainly, because workers are loading files relative to their own location, while normal scripts are loading them relative to to path of the embedding page. Since our core features a dual personality, we must pick one of them. Certainly, loading a file always relative to the same installation directory is more convenient and preferable. So, if you’re already running meSpeak.js, you may have to adjust your voice paths (or change them just to the bare voice-ID).
Audio Stream Data Now in a Callback
Since workers are running asynchrounously and we want to maintain the compatibility level (no async — await or promisses), we can’t simply return the audio data, when called with the “rawdata” option. Instead, you must now supply a callback function in order to receive the audio data in the format specified:
- var jobId = meSpeak.speak('Hello world', { pitch: 60, variant: 'm6', rawdata: 'data-url' }, function(success, id, data) { if (success) myFiles.push(data); } );
(Audio data, a binary wav-file in uint8 representation, may be either exported as ArrayBuffer [default], a normal array, as a base64-encoded string, or a data-url.)
Audio Filters and Stereo Panning
Finally, you may now add basic Web Audio filters (BiquadFilters as “lowpass”, “highpass”, “bandpass”, “lowshelf”, “highshelf”, “peaking”, “notch”, “allpass”, and DynamicsCompressors) by a simple API for post-processing of the audio. While these filters apply on a global level, you may also now pan individual utterances (for this converting the audio from mono to stereo). Mind that this achieved by means of the Web Audio API and exported streams will be still in mono. However, you may use panning also when playing back cached streams by “meSpeak.play()”
Another addition is meSpeak.getAudioAnalyser(), which returns a Web Audio AnalyserNode.
— See the examples below.

A Stereo Panning Example

This is an example for reading a text with distributed voices. The dialog, a sample transcript of a session with Joseph Weizenbaum’s famous ELIZA natural language conversation program [1], is read by two distinctive characters: A female, US-English speaking voice representing the user, positioned 80% to the left, and a male character using the Englisch Received Pronunciation (RP), who inpersonates the virtual doctor, speaking from 80% to the right.

(Mind that there are severe limitations for audio playback on mobile with workarounds resulting in limited performance, and most likely memory constraints, as well. So this example may not work properly on a mobile device.)

Men are all alike.

IN WHAT WAY?

They're always bugging us about something or other.

CAN YOU THINK OF A SPECIFIC EXAMPLE?

Well, my boyfriend made me come here.

YOUR BOYFRIEND MADE YOU COME HERE?

He says I'm depressed much of the time.

I AM SORRY TO HEAR YOU ARE DEPRESSED.

It's true. I am unhappy.

DO YOU THINK COMING HERE WILL HELP YOU NOT TO BE UNHAPPY?

I need some help, that much seems certain.

WHAT WOULD IT MEAN TO YOU IF YOU GOT SOME HELP?

Perhaps I could learn to get along with my mother.

TELL ME MORE ABOUT YOUR FAMILY.

My mother takes care of me.

WHO ELSE IN YOUR FAMILY TAKES CARE OF YOU?

My father.

YOUR FATHER?

You are like my father in some ways.

WHAT RESEMBLANCE DO YOU SEE?

You are not very aggressive but I think you don't want me to notice that.

WHAT MAKES YOU THINK I AM NOT VERY AGGRESSIVE?

You don't argue with me.

WHY DO YOU THINK I DON'T ARGUE WITH YOU?

You are afraid of me.

DOES IT PLEASE YOU TO BELIEVE I AM AFRAID OF YOU?

My father is afraid of everybody.

WHAT ELSE COMES TO YOUR MIND WHEN YOU THINK OF YOUR FATHER?

Bullies.

DOES THAT HAVE ANYTHING TO DO WITH THE FACT THAT YOUR BOYFRIEND MADE YOU COME HERE?

(The capitalized lines are the machine responses.)

[1]: Weizenbaum, Joseph "ELIZA – A Computer Program For the Study of Natural Language Communication Between Man and Machine"
in: Communications of the ACM; Volume 9 , Issue 1 (January 1966): p 36-45.

Note: Curiously, this runs on a dated version of Safari (9.1.3) faster than on the current built of Firefox (69.0.1). Specifically, Firefox may introduce pauses when resuming the worker, which is here hidden by buffering audio.

Audio AnalyserNode Example

This is a simple demonstration of meSpeak.getAudioAnalyser(), which returns a Web Audio AnalyserNode. Here, we draw an oscilloscope display of the waveform generated by meSpeak.speak().

Oscilloscope Bars

This may be useful for animating an avatar or icon…

Backstory (or Why it Did Take so Long)

So, why was this update so long planned for — or in other words, why did it take so long? Now, as mentioned before, meSpeak.js is based on an early incarnation of speak.js (as of 2011) with some changes applied for enhanced compatibility, even then. In the meantime, Emscripten has evolved quite rapidly and (sadly) eventually stopped to compile a working instance of the speak.js-project. So I was essentially stuck with this dated instance, hand-tuned for compatibility as broad as possible. Then, eventually, there was yet another release of speak.js, using a worker, but now Emscripten wouldn’t run on conseccutive calls preserving any loaded files. (This was/is probably due to an orientation towards running video games and other emulations as in JS-MESS.) So, we were stuck with this old instance again. However, while modern Emscripten compiles to WebAssembly, providing much improved runtime-speeds, it also moves the goal post for compatibility quite aggressively. Generally, it requires the respectively latest browsers to run (and there may be even exceptions to this.) On the other hand, running an old *NIX application, which started on Acorn/RISC_OS in 1995, doesn’t require the latest in performance. Running it in the background may be good enough, if we may maintain the benefits a full access to eSpeak’s option, playback via the Web Audio API, and modular voice definitions. As a bonus, this will run on anything as “recent” as 2011. If meSpeak.js did run before, it will do so in version 2.0 as well. All it took was the effort of revisiting the script, separating the core from the front-end and adding a few features, I had wished for for long. (And even a bit of additional hand-tuning of the dated Emscripten core, namely for overwriting existing files.)

So, here you go: meSpeak.js v.2.0.

Norbert Landsteiner,
Vienna, 2019-08-10

Discussion/comments on Hacker News: news.ycombinator.com/item?id=20661193. (Oops, front page.)