Do you want to use OpenAI to make your cross-platform apps generate realistic speech from text? Read all about how to do this in this article from Tech Partner Softacom.
Overview Of How To Use OpenAI To Add Realistic Text-To-Speech To Your Apps
Modern artificial intelligence-based services enable speech generation and speech-to-text conversion. Moreover, they support a wide range of languages. We can easily input text into the service and receive synthesized speech output. Thanks to the available settings, we can also choose the type of voice for the generated speech.
Additionally, it’s possible to convert speech into text. For example, we can transcribe songs from MP3 tracks of our favorite artists into text.
In the article, we will analyze the capabilities of the OpenAI API for speech generation based on textual descriptions, and vice versa, the generation of text from speech in our Embarcadero Delphi FMX application.
To use the features of the OpenAI API for speech generation and text transcription, we need to register and obtain a secret key. In the article dedicated to text generation, we demonstrated the process of registration and obtaining a secret key (API key).
To generate speech based on user requests, we will utilize the OpenAI API (the “Create speech” tab).
The OpenAI API offers extensive functionality for speech generation. Here we can configure the voice type, output media file format (such as mp3, wav, etc.), speech speed in the generated media file, textual description for the generated speech, and the machine learning model that will be used (tts-1 or tts-1-hd).
To extract text from speech, we will also use the OpenAI API (the “Create transcription” tab).
The OpenAI API also has rich capabilities for generating text based on speech.
Here, we can configure the type of input media file (mp3, wav, etc.) and the format of the response from OpenAI (json, text, srt, verbose_json, or vtt).
To enhance the usability of the OpenAI API for generating speech based on user requests and generating text from speech, we will expand the capabilities of the previously developed TChatGPT class described in the earlier published article focused on text generation.
Let’s add to our class such methods as GetGeneratedSpeechAsStream and GetGeneratedTextFromSpeech. We will also add an overloaded constructor method Create to enable the extraction of text from speech in a media file.
The taken version of the constructor method for the TChatGPT class accepts the following objects as input parameters: HttpBasicAuthenticator (the THTTPBasicAuthenticator class), RESTClient (the TRESTClient class), RESTRequest (the TRESTRequest class), and a string constant OpenAIApiKey with our secret key.
ChatGPT class allows us to obtain a TMemoryStream object containing our generated speech based on a textual description using the OpenAI API.
Its input parameters include string constants Input and Voice with representing our textual description and the type of generated voice (alloy, echo, fable, onyx, nova, and shimmer).
Details about the machine learning model that is used (in our case, tts-1), as well as the textual description for speech generation (input) and voice type (voice), are contained in JObj, an object of the TJSONObject class.
The string variable Request stores the data from the JObj object as a string. Further, the content of Request in string format will be passed into a StringStream object (the TStringStream class).
Next, the string data from StringStream is transferred to a MultipartFormData object (the TMultipartFormData class).
The URL of the OpenAI API for speech generation is sent to the Post method of the FNetHttpClient object as the input parameter.
Moreover, a MultipartFormData object with our model data, textual description for speech generation, and voice type is passed as well.
Similar to text and image generation projects, we also need to include headers (Authorization and Content-Type). Upon executing the FNetHttpClient.Post method, we will obtain the generated speech from OpenAI in the form of a TMemoryStream.
The GetGeneratedTextFromSpeech method will allow us to convert speech into text. The method takes a string constant InputAudioFilePath as input, which contains the path to the media file. The BaseURL property of the FRESTClient object contains the URL of the OpenAI API for generating text based on speech from our media file. The FRESTRequest object contains information about the type of response from OpenAI (text, json, srt, verbose_json, or vtt), the machine learning model used (in our case, whisper-1), and the path to our media file with recorded speech.
Authentication will be performed using the FHTTPBasicAuthenticator object (the THTTPBasicAuthenticator class). We need to assign our secret key to the Password field (FHTTPBasicAuthenticator.Password:= FOpenAIApiKey).
The FRESTRequest.Execute method will perform a POST request, passing the media file to extract text from it using OpenAI. As a result, we will receive a string with the converted speech text (Result:= FRESTRequest.Response.Content).
The full source code of the TChatGPT class is presented below.
Implementation of speech generation based on textual description and text extraction from speech in a media file in our Embarcadero Delphi FMX application
In our Delphi FMX application, we will use the TNetHttpClient component to work with the OpenAI API, specifically for sending POST requests to OpenAI.
To play the speech generated by OpenAI and saved in a media file (in MP3 format) in our Embarcadero Delphi FMX application, we will use the TMediaPlayer component.
To make a request to OpenAI with the transfer of a saved media file for extracting text from speech within it, we will use three components: TRESTClient, TRESTRequest, and THTTPBasicAuthenticator.
No additional setup is required for these components. TRESTClient and TRESTRequest are used to make POST requests and retrieve data from OpenAI with the extracted speech text from our media file. THTTPBasicAuthenticator is used for authentication using the secret key.
To input textual descriptions for speech generation, we will use the TMemo component.
We will also use the TMemo component to display the extracted text from the speech in the media file.
In the onCreate method of the main form, we need to assign the path to the media file where our speech generated by OpenAI will be saved to the FAudioFilePath field. We will also assign the value of the secret key to the FOpenAIApiKey field.
The functionality of speech generation based on textual description with saving to a media file and playback in our Embarcadero Delphi FMX application is implemented in the onClick handler of the “Send Request For Speech Generation” button. In this handler, we will declare objects GPTHelper (the IChatGPTHelper type) to pass the textual description to OpenAI for speech generation, and ImageStream (the TMemoryStream class) to store the generated speech as a TMemoryStream.
Next, we will call the constructor of the TChatGPT class, passing such objects as NetHttpClient1 and our secret key (FOpenAIApiKey). Then, we will invoke the GetGeneratedSpeechAsStream method, providing such parameters as the textual description of the generated speech (Memo2.Text) and the voice type (the string ‘onyx’ in our example). To prevent blocking the application interface during the execution of requests, we will use TTask.Run. The result of executing the GetGeneratedSpeechAsStream method, namely the generated speech, is saved into ImageStream.
In the main application thread, using TThread.Synchronize, we will save our speech to an MP3 media file using the ImageStream SaveToFile method. During this process, we will check if a file exists at the specified path using the FileExists function. If the file exists, we need to delete it using the DeleteFile function. After saving, we will play the media file in our Embarcadero Delphi FMX application using TMediaPlayer (MediaPlayer1.Play). To do this, we need to provide the path to our media file (MediaPlayer1.FileName).
The code for the “Send Request For Speech Generation” button handler is provided below.
Now let’s extract text from the speech in our saved media file. We will implement this functionality in the onClick handler of the “Speech From Audio File To Text” button. In the handler, we will declare an object GPTHelper (the IChatGPTHelper) type to pass our media file to OpenAI for text extraction. We will also declare a string variable Text, where we will store the extracted text from the media file.
Next, we should call the second variant of the constructor with four input parameters (HTTPBasicAuthenticator1, RESTClient1, RESTRequest1, FOpenAIApiKey). Then, we will invoke the GetGeneratedTextFromSpeech method, passing the path to our media file. This method will return the extracted text from the speech in the media file. Finally, we will display the received text using TMemo (Memo1.Text).
The code for the “Speech From Audio File To Text” button handler is provided below.
You also need to add the following code to the FormCreate event – you need to replace the key string with your own OpenAPI key:
Let’s test our Embarcadero Delphi FMX application. First, based on a textual description, we will generate speech. The speech will be played using TMediaPlayer and saved to a media file with an mp3 extension.
In our Embarcadero Delphi FMX application, the media file will be saved in the “Documents” directory.
Now, using our application, let’s convert our speech saved in the media file back into text.
Where can I download the example code?
The code is in this repository: https://github.com/Embarcadero/OpenAI_Audio_Demo
Do you want to try some of these examples for yourself? Why not download a free trial of the latest version of RAD Studio with Delphi?
This article was written by Embarcadero Tech Partner Softacom. Softacom specialize in all sorts of software development focused on Delphi. Read more about their services on the Softacom website.