Do you want to use OpenAI to make your cross-platform apps generate realistic speech from text? Read all about how to do this in this article from Tech Partner Softacom.
Overview Of How To Use OpenAI To Add Realistic Text-To-Speech To Your Apps
Modern artificial intelligence-based services enable speech generation and speech-to-text conversion. Moreover, they support a wide range of languages. We can easily input text into the service and receive synthesized speech output. Thanks to the available settings, we can also choose the type of voice for the generated speech.
Additionally, it’s possible to convert speech into text. For example, we can transcribe songs from MP3 tracks of our favorite artists into text.
In the article, we will analyze the capabilities of the OpenAI API for speech generation based on textual descriptions, and vice versa, the generation of text from speech in our Embarcadero Delphi FMX application.
To use the features of the OpenAI API for speech generation and text transcription, we need to register and obtain a secret key. In the article dedicated to text generation, we demonstrated the process of registration and obtaining a secret key (API key).
To generate speech based on user requests, we will utilize the OpenAI API (the “Create speech” tab).
The OpenAI API offers extensive functionality for speech generation. Here we can configure the voice type, output media file format (such as mp3, wav, etc.), speech speed in the generated media file, textual description for the generated speech, and the machine learning model that will be used (tts-1 or tts-1-hd).
To extract text from speech, we will also use the OpenAI API (the “Create transcription” tab).
The OpenAI API also has rich capabilities for generating text based on speech.
Here, we can configure the type of input media file (mp3, wav, etc.) and the format of the response from OpenAI (json, text, srt, verbose_json, or vtt).
To enhance the usability of the OpenAI API for generating speech based on user requests and generating text from speech, we will expand the capabilities of the previously developed TChatGPT class described in the earlier published article focused on text generation.
Let’s add to our class such methods as GetGeneratedSpeechAsStream and GetGeneratedTextFromSpeech. We will also add an overloaded constructor method Create to enable the extraction of text from speech in a media file.
The taken version of the constructor method for the TChatGPT class accepts the following objects as input parameters: HttpBasicAuthenticator (the THTTPBasicAuthenticator class), RESTClient (the TRESTClient class), RESTRequest (the TRESTRequest class), and a string constant OpenAIApiKey with our secret key.
ChatGPT class allows us to obtain a TMemoryStream object containing our generated speech based on a textual description using the OpenAI API.
Its input parameters include string constants Input and Voice with representing our textual description and the type of generated voice (alloy, echo, fable, onyx, nova, and shimmer).
Details about the machine learning model that is used (in our case, tts-1), as well as the textual description for speech generation (input) and voice type (voice), are contained in JObj, an object of the TJSONObject class.
The string variable Request stores the data from the JObj object as a string. Further, the content of Request in string format will be passed into a StringStream object (the TStringStream class).
Next, the string data from StringStream is transferred to a MultipartFormData object (the TMultipartFormData class).
The URL of the OpenAI API for speech generation is sent to the Post method of the FNetHttpClient object as the input parameter.
Moreover, a MultipartFormData object with our model data, textual description for speech generation, and voice type is passed as well.
Similar to text and image generation projects, we also need to include headers (Authorization and Content-Type). Upon executing the FNetHttpClient.Post method, we will obtain the generated speech from OpenAI in the form of a TMemoryStream.
The GetGeneratedTextFromSpeech method will allow us to convert speech into text. The method takes a string constant InputAudioFilePath as input, which contains the path to the media file. The BaseURL property of the FRESTClient object contains the URL of the OpenAI API for generating text based on speech from our media file. The FRESTRequest object contains information about the type of response from OpenAI (text, json, srt, verbose_json, or vtt), the machine learning model used (in our case, whisper-1), and the path to our media file with recorded speech.
Authentication will be performed using the FHTTPBasicAuthenticator object (the THTTPBasicAuthenticator class). We need to assign our secret key to the Password field (FHTTPBasicAuthenticator.Password:= FOpenAIApiKey).
The FRESTRequest.Execute method will perform a POST request, passing the media file to extract text from it using OpenAI. As a result, we will receive a string with the converted speech text (Result:= FRESTRequest.Response.Content).
The full source code of the TChatGPT class is presented below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 |
unit ChatGPTHelper; interface uses System.SysUtils, System.Types, System.UITypes, System.Classes, System.Variants, FMX.Types, FMX.Controls, FMX.Forms, FMX.Graphics, FMX.Dialogs, FMX.Memo.Types, FMX.ScrollBox, FMX.Memo, FMX.StdCtrls, FMX.Controls.Presentation, System.Net.URLClient, System.Net.HttpClient, System.Net.HttpClientComponent, JSON, System.Threading, System.Net.Mime, System.Generics.Collections, REST.Client, REST.Types, REST.Authenticator.Basic; type IChatGPTHelper = interface function SendTextToChatGPT(const Text: string): string; function GetJSONWithImage(const Prompt: string; ResponseFormat: Integer): string; function GetImageURLFromJSON(const JsonResponse: string): string; function GetImageAsStream(const ImageURL: string): TMemoryStream; function GetImageBASE64FromJSON(const JsonResponse: string): string; function GetGeneratedSpeechAsStream(const Input: string; const Voice: string): TMemoryStream; function GetGeneratedTextFromSpeech(const InputAudioFilePath: string): string; end; TChatGPT = class(TInterfacedObject, IChatGPTHelper) private FNetHttpClient: TNetHTTPClient; FHttpBasicAuthenticator: THTTPBasicAuthenticator; FRestRequest: TRESTRequest; FRestClient: TRESTClient; FOpenAIApiKey: string; FText: string; function FormatJSON(const JSON: string): string; function SendTextToChatGPT(const Text: string): string; function GetJSONWithImage(const Prompt: string; ResponseFormat: Integer): string; function GetImageURLFromJSON(const JsonResponse: string): string; function GetImageAsStream(const ImageURL: string): TMemoryStream; function GetImageBASE64FromJSON(const JsonResponse: string): string; function GetGeneratedSpeechAsStream(const Input: string; const Voice: string): TMemoryStream; function GetGeneratedTextFromSpeech(const InputAudioFilePath: string): string; public constructor Create(const NetHttpClient: TNetHTTPClient; const OpenAIApiKey: string); overload; constructor Create(const HttpBasicAuthentificator: THTTPBasicAuthenticator; const RESTClient: TRESTClient; const RESTRequest: TRESTRequest; const OpenAIApiKey: string); overload; class function MessageContentFromChatGPT(const JsonAnswer: string): string; end; implementation { TFirebaseAuth } constructor TChatGPT.Create(const NetHttpClient: TNetHTTPClient; const OpenAIApiKey: string); begin FNetHttpClient := NetHttpClient; if OpenAIApiKey <> '' then FOpenAIApiKey := OpenAIApiKey else begin ShowMessage('OpenAI API key is empty!'); Exit; end; end; constructor TChatGPT.Create(const HttpBasicAuthentificator: THTTPBasicAuthenticator; const RESTClient: TRESTClient; const RESTRequest: TRESTRequest; const OpenAIApiKey: string); begin FHttpBasicAuthenticator := HttpBasicAuthentificator; FRestRequest := RESTRequest; FRestClient := RESTClient; if OpenAIApiKey <> '' then FOpenAIApiKey := OpenAIApiKey else begin ShowMessage('OpenAI API key is empty!'); Exit; end; end; function TChatGPT.FormatJSON(const JSON: string): string; var JsonObject: TJsonObject; begin JsonObject := TJsonObject.ParseJSONValue(JSON) as TJsonObject; try if Assigned(JsonObject) then Result := JsonObject.Format() else Result := JSON; finally JsonObject.Free; end; end; function TChatGPT.GetGeneratedSpeechAsStream(const Input, Voice: string): TMemoryStream; var JObj: TJsonObject; Request: string; MyHeaders: TArray<TNameValuePair>; StringStream: TStringStream; begin JObj := nil; StringStream := nil; try Result := TMemoryStream.Create; SetLength(MyHeaders, 2); MyHeaders[0] := TNameValuePair.Create('Authorization', FOpenAIApiKey); MyHeaders[1] := TNameValuePair.Create('Content-Type', 'application/json'); JObj := TJSONObject.Create; JObj.AddPair('model', 'tts-1'); JObj.AddPair('input', Input); JObj.AddPair('voice', Voice); Request := JObj.ToString; StringStream := TStringStream.Create(Request, TEncoding.UTF8); FNetHttpClient.Post('https://api.openai.com/v1/audio/speech', StringStream, Result, MyHeaders); finally JObj.Free; StringStream.Free; end; end; function TChatGPT.GetGeneratedTextFromSpeech(const InputAudioFilePath: string): string; begin FRESTClient.Authenticator := FHTTPBasicAuthenticator; FRESTRequest.Method := TRESTRequestMethod.rmPOST; FHTTPBasicAuthenticator.Password := FOpenAIApiKey; FRESTClient.BaseURL := 'https://api.openai.com/v1/audio/transcriptions'; FRESTRequest.AddParameter('response_format', 'text', TRESTRequestParameterKind.pkREQUESTBODY); FRESTRequest.AddParameter('model', 'whisper-1', TRESTRequestParameterKind.pkREQUESTBODY); FRESTRequest.AddFile('file', InputAudioFilePath, TRESTContentType.ctAPPLICATION_OCTET_STREAM); FRESTRequest.Client := FRESTClient; FRESTRequest.Execute; Result := FRESTRequest.Response.Content; end; function TChatGPT.GetImageAsStream(const ImageURL: string): TMemoryStream; begin Result := TMemoryStream.Create; FNetHTTPClient.Get(ImageURL, Result); end; function TChatGPT.GetImageURLFromJSON(const JsonResponse: string): string; var Json: TJsonObject; DataArr: TJsonArray; begin Json := TJsonObject.ParseJSONValue(JsonResponse) as TJsonObject; try if Assigned(Json) then begin DataArr := TJsonArray(Json.Get('data').JsonValue); Result := TJSONPair(TJSONObject(DataArr.Items[0]).Get('url')).JsonValue.Value; end else Result := ''; finally Json.Free; end; end; function TChatGPT.GetImageBASE64FromJSON(const JsonResponse: string): string; var Json: TJsonObject; DataArr: TJsonArray; begin Json := TJsonObject.ParseJSONValue(JsonResponse) as TJsonObject; try if Assigned(Json) then begin DataArr := TJsonArray(Json.Get('data').JsonValue); Result := TJSONPair(TJSONObject(DataArr.Items[0]).Get('b64_json')).JsonValue.Value; end else Result := ''; finally Json.Free; end; end; function TChatGPT.GetJSONWithImage(const Prompt: string; ResponseFormat: Integer): string; var JObj: TJsonObject; Request: string; ResponseContent, StringStream: TStringStream; MyHeaders: TArray<TNameValuePair>; begin JObj := nil; ResponseContent := nil; StringStream := nil; try SetLength(MyHeaders, 2); MyHeaders[0] := TNameValuePair.Create('Authorization', FOpenAIApiKey); MyHeaders[1] := TNameValuePair.Create('Content-Type', 'application/json'); JObj := TJSONObject.Create; with JObj do begin Owned := False; AddPair('model', 'dall-e-2'); if ResponseFormat = 1 then AddPair('response_format','b64_json') else AddPair('response_format','url'); AddPair('prompt', Prompt); AddPair('n', TJSONNumber.Create(1)); AddPair('size', '1024x1024'); end; Request := Jobj.ToString; StringStream := TStringStream.Create(Request, TEncoding.UTF8); ResponseContent := TStringStream.Create; FNetHttpClient.Post('https://api.openai.com/v1/images/generations', StringStream, ResponseContent, MyHeaders); Result := ResponseContent.DataString; finally JObj.Free; ResponseContent.Free; StringStream.Free; end; end; class function TChatGPT.MessageContentFromChatGPT(const JsonAnswer: string): string; var Mes: TJsonArray; JsonResp: TJsonObject; begin JsonResp := nil; try JsonResp := TJsonObject.ParseJSONValue(JsonAnswer) as TJsonObject; if Assigned(JsonResp) then begin Mes := TJsonArray(JsonResp.Get('choices').JsonValue); Result := TJsonObject(TJsonObject(Mes.Get(0)).Get('message').JsonValue). GetValue('content').Value; end else Result := ''; finally JsonResp.Free; end; end; function TChatGPT.SendTextToChatGPT(const Text: string): string; var JArr: TJsonArray; JObj, JObjOut: TJsonObject; Request: string; ResponseContent, StringStream: TStringStream; Headers: TArray<TNameValuePair>; I: Integer; begin JArr := nil; JObj := nil; JObjOut := nil; ResponseContent := nil; StringStream := nil; try SetLength(Headers, 2); Headers[0] := TNameValuePair.Create('Authorization', FOpenAIApiKey); Headers[1] := TNameValuePair.Create('Content-Type', 'application/json'); JObj := TJsonObject.Create; JObj.Owned := False; JObj.AddPair('role', 'user'); JArr := TJsonArray.Create; JArr.AddElement(JObj); Self.FText := Text; JObj.AddPair('content', FText); JObjOut := TJsonObject.Create; JObjOut.AddPair('model', 'gpt-3.5-turbo'); JObjOut.AddPair('messages', Trim(JArr.ToString)); JObjOut.AddPair('temperature', TJSONNumber.Create(0.7)); Request := JObjOut.ToString.Replace('', ''); for I := 0 to Length(Request) - 1 do begin if ((Request[I] = '"') and (Request[I + 1] = '[')) or ((Request[I] = '"') and (Request[I - 1] = ']')) then begin Request[I] := ' '; end; end; ResponseContent := TStringStream.Create; StringStream := TStringStream.Create(Request, TEncoding.UTF8); FNetHttpClient.Post('https://api.openai.com/v1/chat/completions', StringStream, ResponseContent, Headers); Result := FormatJSON(ResponseContent.DataString); finally StringStream.Free; ResponseContent.Free; JObjOut.Free; JArr.Free; JObj.Free; end; end; end. |
Implementation of speech generation based on textual description and text extraction from speech in a media file in our Embarcadero Delphi FMX application
In our Delphi FMX application, we will use the TNetHttpClient component to work with the OpenAI API, specifically for sending POST requests to OpenAI.
To play the speech generated by OpenAI and saved in a media file (in MP3 format) in our Embarcadero Delphi FMX application, we will use the TMediaPlayer component.
To make a request to OpenAI with the transfer of a saved media file for extracting text from speech within it, we will use three components: TRESTClient, TRESTRequest, and THTTPBasicAuthenticator.
No additional setup is required for these components. TRESTClient and TRESTRequest are used to make POST requests and retrieve data from OpenAI with the extracted speech text from our media file. THTTPBasicAuthenticator is used for authentication using the secret key.
To input textual descriptions for speech generation, we will use the TMemo component.
We will also use the TMemo component to display the extracted text from the speech in the media file.
In the onCreate method of the main form, we need to assign the path to the media file where our speech generated by OpenAI will be saved to the FAudioFilePath field. We will also assign the value of the secret key to the FOpenAIApiKey field.
The functionality of speech generation based on textual description with saving to a media file and playback in our Embarcadero Delphi FMX application is implemented in the onClick handler of the “Send Request For Speech Generation” button. In this handler, we will declare objects GPTHelper (the IChatGPTHelper type) to pass the textual description to OpenAI for speech generation, and ImageStream (the TMemoryStream class) to store the generated speech as a TMemoryStream.
Next, we will call the constructor of the TChatGPT class, passing such objects as NetHttpClient1 and our secret key (FOpenAIApiKey). Then, we will invoke the GetGeneratedSpeechAsStream method, providing such parameters as the textual description of the generated speech (Memo2.Text) and the voice type (the string ‘onyx’ in our example). To prevent blocking the application interface during the execution of requests, we will use TTask.Run. The result of executing the GetGeneratedSpeechAsStream method, namely the generated speech, is saved into ImageStream.
In the main application thread, using TThread.Synchronize, we will save our speech to an MP3 media file using the ImageStream SaveToFile method. During this process, we will check if a file exists at the specified path using the FileExists function. If the file exists, we need to delete it using the DeleteFile function. After saving, we will play the media file in our Embarcadero Delphi FMX application using TMediaPlayer (MediaPlayer1.Play). To do this, we need to provide the path to our media file (MediaPlayer1.FileName).
The code for the “Send Request For Speech Generation” button handler is provided below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
procedure TForm1.Button1Click(Sender: TObject); var GPTHelper: IChatGPTHelper; ImageStream: TMemoryStream; begin TTask.Run( procedure begin GPTHelper := TChatGPT.Create(NetHTTPClient1, 'Bearer ' + FOpenAIApiKey); ImageStream := GPTHelper.GetGeneratedSpeechAsStream(Memo2.Text, 'onyx'); try TThread.Synchronize(nil, procedure begin if FileExists(FAudioFilePath) then begin DeleteFile(FAudioFilePath); ImageStream.SaveToFile(FAudioFilePath); end else ImageStream.SaveToFile(FAudioFilePath); MediaPlayer1.FileName := FAudioFilePath; MediaPlayer1.Play; ShowMessage('All is done!!!'); end); finally ImageStream.Free; end; end); end; |
Now let’s extract text from the speech in our saved media file. We will implement this functionality in the onClick handler of the “Speech From Audio File To Text” button. In the handler, we will declare an object GPTHelper (the IChatGPTHelper) type to pass our media file to OpenAI for text extraction. We will also declare a string variable Text, where we will store the extracted text from the media file.
Next, we should call the second variant of the constructor with four input parameters (HTTPBasicAuthenticator1, RESTClient1, RESTRequest1, FOpenAIApiKey). Then, we will invoke the GetGeneratedTextFromSpeech method, passing the path to our media file. This method will return the extracted text from the speech in the media file. Finally, we will display the received text using TMemo (Memo1.Text).
The code for the “Speech From Audio File To Text” button handler is provided below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
procedure TForm1.Button4Click(Sender: TObject); var GPTHelper: IChatGPTHelper; Text: string; begin TTask.Run( procedure begin GPTHelper := TChatGPT.Create(HTTPBasicAuthenticator1, RESTClient1, RESTRequest1, FOpenAIApiKey); Text := GPTHelper.GetGeneratedTextFromSpeech(FAudioFilePath); TThread.Synchronize(nil, procedure begin Memo1.Text := Text; ShowMessage('All is done!!!'); end); end); end; |
You also need to add the following code to the FormCreate event – you need to replace the key string with your own OpenAPI key:
1 2 3 4 5 |
procedure TForm1.FormCreate(Sender: TObject); begin FAudioFilePath := TPath.Combine(TPath.GetDocumentsPath, 'GeneratedVoice.mp3'); FOpenAIApiKey := '**** YOUR OPEN API KEY GOES HERE ****'; end; |
Let’s test our Embarcadero Delphi FMX application. First, based on a textual description, we will generate speech. The speech will be played using TMediaPlayer and saved to a media file with an mp3 extension.
In our Embarcadero Delphi FMX application, the media file will be saved in the “Documents” directory.
Now, using our application, let’s convert our speech saved in the media file back into text.
Where can I download the example code?
The code is in this repository: https://github.com/Embarcadero/OpenAI_Audio_Demo
Do you want to try some of these examples for yourself? Why not download a free trial of the latest version of RAD Studio with Delphi?
This article was written by Embarcadero Tech Partner Softacom. Softacom specialize in all sorts of software development focused on Delphi. Read more about their services on the Softacom website.
Design. Code. Compile. Deploy.
Start Free Trial Upgrade Today
Free Delphi Community Edition Free C++Builder Community Edition
I’ve got a compile error as below:
[dcc64 Error] ChatGPTHelper.pas(121): E2003 Undeclared identifier: ‘LoadFromStream’
I am using Embarcadero® RAD Studio 12 Version 29.0.51961.7529
Any advice, please?
Hi there – I will ask the blog post author – Softacom – to reply with any advice they have.
OK the post has been updated to correct the problem. Softacom will also email you the code directly.
Thank you Ian and Softacom!
FYI, The updated part (the full source code of TChatGPT class) is displayed in one line so it is not easy to look through.
Hi BY – thanks for pointing that out. I’ve updated the post to use our regular syntax highlighted block type so the code will be a lot less problematic now.
Hi!,
Where is the update? i copy the code and get the same error… really i have the same problem in the post about Firebase (https://blogs.embarcadero.com/how-to-use-the-firebase-api-to-add-read-and-delete-data-in-a-realtime-document-oriented-database/)) but i can’t found the solution.
Thanks in advance
Sorry Jordi, I have fixed it now after Softacom sent me a new version. I’ve updated the article and also published the source code (tested with RAD Studio 12) here: https://github.com/Embarcadero/OpenAI_Audio_Demo
Hi Ian,
Where is the solution? i copy all the code and have the same error, in the post about firebase i have the same error.
Any solution, please?
Thanks in advance
Yes, sorry about that, a few gremlins sneaked in. I’ve updated the article and also published the source code (tested with RAD Studio 12) here: https://github.com/Embarcadero/OpenAI_Audio_Demo