Cross Platform Text-to-Speech
January 9, 2017 Erik van Bilsen
January 9, 2017 Erik van Bilsen
[SHOWTOGROUPS=4,20]
This post is a small exercise in designing a cross platform abstraction layer for platform-specific functionality. In particular, we present a small Delphi library to add cross platform text-to-speech to your app. It works on Windows, macOS, iOS and Android.
You can find the Для просмотра ссылки Войдиили Зарегистрируйся on GitHub as part of the Для просмотра ссылки Войди или Зарегистрируйся repository.
If you are only interested in the end result, then you can stick to the first part of this post and bail when we get to the implementation details.
Choosing a feature set
A common issue with abstracting platform differences is that you must decide on a feature set. A specific feature may be supported on one platform, but not on another. When it comes to text-to-speech, some platforms support choosing a voice, changing the pitch or speech rate, customize pronunciation with markup in the text to speak etc. Other platforms may not support some of these features, or only in an incompatible way.
You can choose to go for the lowest common denominator approach and only expose those features that are supported on all platforms. Or you can choose to support more features which either do nothing on certain platforms or raise some kind of “not supported” exception. Or you can have a combination of both.
Also, there may be some features (or issues) on one platform that also affect the API for other platforms. For example, on all platforms except Android, you can start speaking immediately after you create the text-to-speech engine. On Android though, you have to wait until the engine has fully initialized in the background before you can use it. This means that we have to add some sort of notification to the engine to let clients know that the engine has initialized. On non-Android platforms, we just fire this event immediately after construction.
To keep the size of this post somewhat manageable, we only support the most basic text-to-speech features: speaking some text (using the default voice and settings) and stopping it. At the end of this post, you should be able to add other features yourself.
IgoTextToSpeech API
In our blog post about Для просмотра ссылки Войдиили Зарегистрируйся we presented different ways to abstract platform-specific API differences. For the text-to-speech library we use the “object interface” approach as discussed in that post. The text-to-speech API is defined in an interface called IgoTextToSpeech (in the unit Grijjy.TextToSpeech):
To speak some text, just call the Speak method and supply the text to speak. If the engine was already speaking some text, then the current speech will be terminated. This method is asynchronous and returns immediately while the text is spoken in the background. Once the engine actually starts to speak, it will fire the OnSpeechStarted event.
You can use Stop to stop speaking. Depending on the platform, this will stop speaking immediately or wait until the current word has been finished. The OnSpeechFinished event will be fired when the speech has stopped, either because there is no more text to speak, or you called Stop to terminate the speech.
Finally, you will note the Available property and OnAvailable event. As said above, on all platforms except Android, the speech engine will be available immediately. The Available property will be True and OnAvailable will be fired immediately after creating the engine.
On Android however, it takes time to initialize the engine and it may not be available immediately. You have to wait until the engine has fully initialized (by checking the Available property), or you can use the OnAvailable event to get notified when this happens.
In all cases, it is probably easiest and safest to always use the OnAvailable event (on all platforms) and don’t speak any text until this event has been fired.
You may have noticed that I started the names of all property/event getter and setter methods with an underscore (like _GetAvailable). This is to persuade the user of the interface to use the property/event instead of calling the method directly. These methods won’t even show up in Code Insight if you prefix them with an underscore (although you can still call them if you insist).
You create an object that implements this interface by simply calling:
This uses the static class function approach to create the platform-specific instance, as discussed in the Для просмотра ссылки Войди или Зарегистрируйся post.
All in all not too complicated. Try out the Для просмотра ссылки Войдиили Зарегистрируйся in the repository to see how it works on all platforms.
Super Simple Text-to-Speech
The remainder of this post discusses how text-to-speech is implemented on the various platforms. Feel free to skip this if you don’t care about the details. However, if you are new to COM or using (and defining) Java classes and Objective-C classes, and want to learn a bit about it, then stick around.
TgoTextToSpeechBase class
A base implementation of the IgoTextToSpeech interface is provided in the abstract TgoTextToSpeechBase class. This class provides the fields for the Available property and the various events, as well as helper methods to fire the events from the main thread (so you can update the UI from those events if you want to).
The actual API methods (Speak, Stop and IsSpeaking) are all virtual and abstract and overridden by the platform-specific derived classes.
Text-to-Speech on Windows
On Windows, we use the Speech API to provide text-to-speech. This API is built using the Component Object Model (COM). Unfortunately, Delphi does not provide the translations of the Speech API header files. However, it is pretty easy to import them as a type library:
Import the Speech API Type Library
Unfortunately the type library importer imports some declarations incorrectly, which can be especially problematic when used inside a 64-bit Windows app. So instead, I extracted the declarations we care about from the imported type library, fixed them and put them at the top of the Grijjy.TextToSpeech.Windows unit.
The main COM object for text-to-speech is exposed through the ISpVoice interface. You create this COM object with the following code:
Then we can speak some text using its Speak method:
We pass the SPF_ASYNC flag so the method returns immediately and speaks the text in the background.
If we want to get notified when the system has finished speaking, then we need to subscribe to an event. We need to let the engine know which events we are interested in (through the SetInterest method) and how we should get notified (by calling the SetNotifyCallbackFunction method):
Note that SetInterest and SetNotifyCallbackFunction methods are not defined in ISpVoice. They are declared in parent interfaces of ISpVoice, called ISpEventSource and ISpNotifySource respectively.
Here we say we are interested in when the system starts and stops speaking (see the Events variable). We pass this variable twice to the SetInterest method. The first one is to tell the system what events we are interested in. The second one (which must be the same as or a subset of the first one) tells the system what events should be queued in the event queue. We need to enable this queuing because it is the only way to know what event has been fired.
The actual notification occurs by calling the function that is passed to the SetNotifyCallbackFunction method. This must be a stdcall function with the following signature:
Note that there are other ways to be notified than using a callback function. For example, you can also have the engine send a Window Message, or you can create a “notification sink” by implementing the ISpNotifySink interface.
You can declare the callback as a global function, or you can make it a static class method. We choose the second approach here to keep things organized.
The callback receives two parameters, which are the same parameters as passed to the SetNotifyCallbackFunction API. We passed Self as the second parameter there so that we can access our object from the (global) callback function (as lParam). Here, the callback just forwards the notification to the HandleVoiceEvent method of our class:
This method just processes the event queue for start and finish notifications and fires the OnSpeechStarted and OnSpeechFinished events accordingly.
That covers the most important concepts on the Windows side.
[/SHOWTOGROUPS]
This post is a small exercise in designing a cross platform abstraction layer for platform-specific functionality. In particular, we present a small Delphi library to add cross platform text-to-speech to your app. It works on Windows, macOS, iOS and Android.
You can find the Для просмотра ссылки Войди
If you are only interested in the end result, then you can stick to the first part of this post and bail when we get to the implementation details.
Choosing a feature set
A common issue with abstracting platform differences is that you must decide on a feature set. A specific feature may be supported on one platform, but not on another. When it comes to text-to-speech, some platforms support choosing a voice, changing the pitch or speech rate, customize pronunciation with markup in the text to speak etc. Other platforms may not support some of these features, or only in an incompatible way.
You can choose to go for the lowest common denominator approach and only expose those features that are supported on all platforms. Or you can choose to support more features which either do nothing on certain platforms or raise some kind of “not supported” exception. Or you can have a combination of both.
Also, there may be some features (or issues) on one platform that also affect the API for other platforms. For example, on all platforms except Android, you can start speaking immediately after you create the text-to-speech engine. On Android though, you have to wait until the engine has fully initialized in the background before you can use it. This means that we have to add some sort of notification to the engine to let clients know that the engine has initialized. On non-Android platforms, we just fire this event immediately after construction.
To keep the size of this post somewhat manageable, we only support the most basic text-to-speech features: speaking some text (using the default voice and settings) and stopping it. At the end of this post, you should be able to add other features yourself.
IgoTextToSpeech API
In our blog post about Для просмотра ссылки Войди
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | type IgoTextToSpeech = interface ['{7797ED2A-0695-445A-BA84-495E280F86AB}'] {$REGION 'Internal Declarations'} function _GetAvailable: Boolean; function _GetOnAvailable: TNotifyEvent; ..etc... {$ENDREGION 'Internal Declarations'} function Speak(const AText: String): Boolean; procedure Stop; function IsSpeaking: Boolean; property Available: Boolean read _GetAvailable; property OnAvailable: TNotifyEvent read _GetOnAvailable write _SetOnAvailable; property OnSpeechStarted: TNotifyEvent read _GetOnSpeechStarted write _SetOnSpeechStarted; property OnSpeechFinished: TNotifyEvent read _GetOnSpeechFinished write _SetOnSpeechFinished; end; |
To speak some text, just call the Speak method and supply the text to speak. If the engine was already speaking some text, then the current speech will be terminated. This method is asynchronous and returns immediately while the text is spoken in the background. Once the engine actually starts to speak, it will fire the OnSpeechStarted event.
You can use Stop to stop speaking. Depending on the platform, this will stop speaking immediately or wait until the current word has been finished. The OnSpeechFinished event will be fired when the speech has stopped, either because there is no more text to speak, or you called Stop to terminate the speech.
Finally, you will note the Available property and OnAvailable event. As said above, on all platforms except Android, the speech engine will be available immediately. The Available property will be True and OnAvailable will be fired immediately after creating the engine.
On Android however, it takes time to initialize the engine and it may not be available immediately. You have to wait until the engine has fully initialized (by checking the Available property), or you can use the OnAvailable event to get notified when this happens.
In all cases, it is probably easiest and safest to always use the OnAvailable event (on all platforms) and don’t speak any text until this event has been fired.
You may have noticed that I started the names of all property/event getter and setter methods with an underscore (like _GetAvailable). This is to persuade the user of the interface to use the property/event instead of calling the method directly. These methods won’t even show up in Code Insight if you prefix them with an underscore (although you can still call them if you insist).
You create an object that implements this interface by simply calling:
1 | TextToSpeech := TgoTextToSpeech.Create; |
All in all not too complicated. Try out the Для просмотра ссылки Войди
Super Simple Text-to-Speech
The remainder of this post discusses how text-to-speech is implemented on the various platforms. Feel free to skip this if you don’t care about the details. However, if you are new to COM or using (and defining) Java classes and Objective-C classes, and want to learn a bit about it, then stick around.
TgoTextToSpeechBase class
A base implementation of the IgoTextToSpeech interface is provided in the abstract TgoTextToSpeechBase class. This class provides the fields for the Available property and the various events, as well as helper methods to fire the events from the main thread (so you can update the UI from those events if you want to).
The actual API methods (Speak, Stop and IsSpeaking) are all virtual and abstract and overridden by the platform-specific derived classes.
Text-to-Speech on Windows
On Windows, we use the Speech API to provide text-to-speech. This API is built using the Component Object Model (COM). Unfortunately, Delphi does not provide the translations of the Speech API header files. However, it is pretty easy to import them as a type library:
- In Delphi, pick the “Component | Import Component…” menu option.
- Select “Import a Type Library”.
- The next page shows all registered type libraries. There should be a “Microsoft Speech Object Library” in there.
- You can finish the wizard and create an import unit.
Import the Speech API Type Library
Unfortunately the type library importer imports some declarations incorrectly, which can be especially problematic when used inside a 64-bit Windows app. So instead, I extracted the declarations we care about from the imported type library, fixed them and put them at the top of the Grijjy.TextToSpeech.Windows unit.
The main COM object for text-to-speech is exposed through the ISpVoice interface. You create this COM object with the following code:
1 | FVoice := CreateComObject(CLASS_SpVoice) as ISpVoice; |
1 2 3 4 5 6 7 | function TgoTextToSpeechImplementation.Speak(const AText: String): Boolean; begin if (FVoice = nil) then Result := False else Result := (FVoice.Speak(PWideChar(AText), SPF_ASYNC, nil) = S_OK); end; |
We pass the SPF_ASYNC flag so the method returns immediately and speaks the text in the background.
If we want to get notified when the system has finished speaking, then we need to subscribe to an event. We need to let the engine know which events we are interested in (through the SetInterest method) and how we should get notified (by calling the SetNotifyCallbackFunction method):
1 2 3 4 5 6 7 8 9 10 11 12 13 | constructor TgoTextToSpeechImplementation.Create; var Events: ULONGLONG; begin inherited Create; FVoice := CreateComObject(CLASS_SpVoice) as ISpVoice; if (FVoice <> nil) then begin Events := SPFEI(SPEI_START_INPUT_STREAM) or SPFEI(SPEI_END_INPUT_STREAM); OleCheck(FVoice.SetInterest(Events, Events)); OleCheck(FVoice.SetNotifyCallbackFunction(VoiceCallback, 0, NativeInt(Self))); end; end; |
Note that SetInterest and SetNotifyCallbackFunction methods are not defined in ISpVoice. They are declared in parent interfaces of ISpVoice, called ISpEventSource and ISpNotifySource respectively.
Here we say we are interested in when the system starts and stops speaking (see the Events variable). We pass this variable twice to the SetInterest method. The first one is to tell the system what events we are interested in. The second one (which must be the same as or a subset of the first one) tells the system what events should be queued in the event queue. We need to enable this queuing because it is the only way to know what event has been fired.
The actual notification occurs by calling the function that is passed to the SetNotifyCallbackFunction method. This must be a stdcall function with the following signature:
1 2 3 4 5 | class procedure TgoTextToSpeechImplementation.VoiceCallback( wParam: WPARAM; lParam: LPARAM); begin TgoTextToSpeechImplementation(lParam).HandleVoiceEvent; end; |
Note that there are other ways to be notified than using a callback function. For example, you can also have the engine send a Window Message, or you can create a “notification sink” by implementing the ISpNotifySink interface.
You can declare the callback as a global function, or you can make it a static class method. We choose the second approach here to keep things organized.
The callback receives two parameters, which are the same parameters as passed to the SetNotifyCallbackFunction API. We passed Self as the second parameter there so that we can access our object from the (global) callback function (as lParam). Here, the callback just forwards the notification to the HandleVoiceEvent method of our class:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | procedure TgoTextToSpeechImplementation.HandleVoiceEvent; var Event: SPEVENT; NumEvents: ULONG; begin if (FVoice = nil) then Exit; { Handle all events in the event queue. Before calling GetEvents, the Event record should be cleared. } FillChar(Event, SizeOf(Event), 0); while (FVoice.GetEvents(1, @Event, NumEvents) = S_OK) do begin case Event.eEventId of SPEI_START_INPUT_STREAM: DoSpeechStarted; SPEI_END_INPUT_STREAM: DoSpeechFinished; end; FillChar(Event, SizeOf(Event), 0); end; end; |
This method just processes the event queue for start and finish notifications and fires the OnSpeechStarted and OnSpeechFinished events accordingly.
That covers the most important concepts on the Windows side.
[/SHOWTOGROUPS]