Using OpenAI's RealTime API _ WorkAdventure Documentation
Using OpenAI's RealTime API _ WorkAdventure Documentation
Recent posts
Tutorial 4: Coding a bell In this article, I'm going to describe our experience creating a WorkAdventure bot using the new Using WorkAdventure?
OpenAI's Realtime API. This API is revolutionary because it allows you to interact with an AI model in Conclusion
speech to speech mode.
Before diving into the details, let's take a look at the final result:
INFO
This article is targeted at developers who want to start using the new OpenAI's Realtime API in
their projects. The API is still in beta as I'm writing this article so things might have moved when
you will read this article.
Previously, to interact with an AI model, you had to turn your voice into text, send it to the model, and
then turn the model's response back into speech. This process was somewhat slow. It could take a few
seconds for the AI to respond, leading to an awkward conversation.
We have already experimented with the previous OpenAI API in WorkAdventure. The results were good,
but the conversation was not as smooth as we would have liked.
With the new OpenAI's Realtime API, the model directly takes your voice as input and responds in real-
time. This makes the conversation smoother. Furthermore, because there is no need to convert speech
to text, the model does not loose important information that could be lost during the transcription
process, like the tone of your voice. And it can also respond with an appropriate tone.
The API is still in beta, but demos were very impressive. So we decided to give it a try.
Interacting with the API is very different from the previous chat completion API. Because we are dealing
with audio, the API keeps sending and receiving messages. This is done through a WebSocket. The API
sends audio chunks to the model and receives audio chunks as responses. Because we are using a
WebSocket, the API is now stateful. This means you no longer need to resend the context of the
conversation at each turn. The model keeps track of the conversation context and can respond
accordingly.
The context of WorkAdventure is somewhat special. Bots are actually scripts that run in JavaScript on
the browser side. Each bot is running in a headless browser tab on a server. Bots are using the Scripting
API to interact with the WorkAdventure map. Because the bots are running in a browser, on the server-
side, we can actually put the OpenAI key right into the browser. This saves us from having to manage
the key on the server-side and use a separate server as a proxy to the API.
If you are looking to implement the real-time API in your own project, it is likely your setup will be
different. You might need a relay server that will live on the server, take calls from the client, and
forward them to the OpenAI API, adding the API key in the process. I would also like to mention that
Livekit seems to have a great higher-level library for handling bots:
https://docs.livekit.io/agents/openai/overview/ I haven't had the opportunity to test it yet, but it looks
promising and you probably should have a look at it before starting.
Getting started
Instead of directly talking on the WebSocket with the OpenAI Realtime API, we decided to use a
wrapper library provided by OpenAI: https://github.com/openai/openai-realtime-api-beta.
Now that the library is installed, let's create our "Robot" class that will handle the communication with
the API:
class Robot {
private realtimeClient: RealtimeClient;
constructor(
private audioTranscriptionModel = "whisper-1",
private voice = "alloy",
) {
}
// We update the session with the voice and the audio transcription
// model we want to use.
this.realtimeClient.updateSession({voice: this.voice});
this.realtimeClient.updateSession({
// VAD means "Voice Activity Detection".
// In "server_vad" mode, we let OpenAI decide when to start and
// stop speaking.
turn_detection: {type: 'server_vad'},
input_audio_transcription: {model: this.audioTranscriptionModel},
});
At this point, we can see in the console that the model is responding with audio chunks.
If we take a look at the "delta" part of the event, we can see some events contain "transcripts", which
are usually single words, and other events contain "audio" chunks.
Ok, so we have the audio chunks and we need to turn them into a MediaStream . A media stream is a
stream of audio data that can be played by the browser. We can create a MediaStream from an array of
audio chunks. The MediaStream can then be played by the browser or sent to a WebRTC connection.
The audio chunks sent by the Realtime API are in a PCM 16 format (that is: raw 16 bit PCM audio at
24kHz, 1 channel, little-endian).
In the browser however, the API in charge of managing audio is the Web Audio API. It is operating on
32bit float numbers. So we need to convert the 16bit integer audio chunks into 32bit float audio chunks.
First of all, I was very worried about this 'little-endian' thing. Little endian means that the least significant
byte is stored first. Handling that in JavaScript is not straightforward. However, when dealing with audio
output from OpenAI, the audio is correctly formatted, so you don't have to worry about endianness.
So turning from 16bit PCM to 32bit float is actually turning an integer value between -32768 and 32767
into a float value between -1 and 1. Which is quite simple.
Now that we have the audio in the correct format, we need to create a MediaStream from it.
There are many ways to do that. The simplest one involves using a ScriptProcessorNode . This is
easy, but unfortunately, it is deprecated. You can also use a AudioBufferSourceNode but the
documentation says it does not play well with streaming audio and is more adapted to play short audio
clips.
The technique I will use is using the AudioWorkletNode . This is the modern way to do things, but it is a
bit more complex.
Audio worklets are running in a separate thread and can be used to process audio in real-time. Although
we don't perform heavy computations that could block the main thread, the usage of audio worklets
mandates processing audio in a separate thread.
The audio worklet can receive audio chunks if a MediaStream is plugged in input and can send some
audio output to a destination MediaStream. It can also communicate with the main thread using the
postMessage API.
In this case, we will use the audio worklet to play the audio chunks we receive from the Realtime API.
We will send the audio chunks to the audio worklet using the postMessage API and the audio worklet
will generate a MediaStream.
The code will be split into 2 files. One file is containing the audio worklet processor and the other file is
the main file that will create the audio worklet node and connect it to the audio output.
Let's start with the audio worklet processor (the one running in a dedicated thread)
Data will be arriving via the postMessage API (in the onmessage method), as an object with this
structure:
{
pcmData: Float32Array
}
The pcmData field contains the audio data in the correct format (32bit float). We will store this data in
the audioQueue .
constructor() {
super();
this.port.onmessage = (event: MessageEvent) => {
const data = event.data.pcmData;
if (data instanceof Float32Array) {
this.audioQueue.push(data);
} else {
console.error("Invalid data type received in worklet",
event.data);
}
};
}
process(inputs: Float32Array[][],
outputs: Float32Array[][],
parameters: Record<string, Float32Array>): boolean {
const output = outputs[0];
const outputData = output[0];
When the processor is running, the process method is called each time the audio worklet needs to
process audio data (so every few milliseconds). The process method has "inputs" and "outputs"
parameters. The "inputs" parameter will be completely ignored in our case. The "outputs" parameter is
an array of arrays of Float32Arrays. The first array contains the output data. In our case, we will only use
the first output array.
If there is data in the audioQueue , the processor will copy the data to the output buffer, trying to fill it as
much as possible. It is important to note that when the process function is called, the output buffer is
already filled with zeros (silence). So if there is no data in the audioQueue , the output buffer will be
filled with silence. This is exactly what we want.
Last but not least, please note that we call registerProcessor at the end of the file. The output-
pcm-worklet-processor is the name we will use to create the audio worklet node in the main file.
Typescript support
If you are using Typescript, it is likely you will be missing the AudioWorkletProcessor type. There is a
NPM package to add this ( @types/audioworklet ), but it conflicts with the DOM types. So instead, I
copied types from a GitHub issue found here into my project.
Now that we have the audio worklet processor, we need to create the audio worklet node in the main
file.
From a bird's eye view, without any error management, the process looks like this:
await this.audioContext.resume();
// Let's load the audio worklet processor (assuming the file is available in the pcm
await this.audioContext.audioWorklet.addModule("output-pcm-processor.js");
// Instantiate the audio worklet node (we use the 'output-pcm-worklet-processor' str
const workletNode = new AudioWorkletNode(this.audioContext, 'output-pcm-worklet-proc
In practice, it is a bit more complex. In a real-world application, you will use a bundler. So we cannot
reference the "pcm-processor.js" file directly without letting the bundler know about it.
In Vite, we can use the "?worker&url" parameter in imports to reference a file directly. This is very useful
for our use case.
// ...
await this.audioContext.audioWorklet.addModule(audioWorkletProcessorUrl);
When this is done, we can send the audio chunks to the audio worklet processor:
workletNode.port.postMessage(
{ pcmData: float32Array },
{ transfer: [float32Array.buffer] }
);
The transfer option is used to transfer the ownership of the buffer to the audio worklet processor.
This is important because the buffer is a large object and we don't want to copy it (that would waste
CPU cycles). We want to transfer it.
constructor(sampleRate = 24000) {
this.audioContext = new AudioContext({ sampleRate });
this.mediaStreamDestination =
this.audioContext.createMediaStreamDestination();
}
this.isWorkletLoaded = false;
}
}
From there, you can use the PCMStreamer class to stream audio data to a MediaStream that can be
used in a WebRTC connection. Whenever the Realtime API is sending an audio chunk, you convert it to
a Float32Array and send it to the PCMStreamer instance via the appendPCMData method.
audioStream.appendAudioData(float32Array);
}
});
Success!
But this is only the beginning. So far, we are getting audio chunks from the Realtime API and playing
them in the browser. Now, we need to send our audio data to the Realtime API!
Hopefully, the openai-realtime-api-beta library manages automatic conversion of 32bit float audio data
to 16bit PCM audio data in little endian on our behalf (the little endian part is important here).
We will just have to make sure we are sending the samples at the requested 24kHz sample rate.
Turning a MediaStream into audio chunks turns out to be quite similar to what we did previously. We will
use the Web Audio API, design a new Worklet. This time, the Worklet will receive a MediaStream -- in the
case of WorkAdventure were multiple people can talk to the IA at the same time, it can receive many
MediaStreams. The MediaStreams will be mixed together and sent back to the main thread using the
postMessage API.
// Let's merge all the inputs in one big Float32Array by summing them
const mergedInput = new Float32Array(inputs[0][0].length);
A quick note about the parameters passed to the process function. The inputs parameter is an array
of arrays of Float32Arrays. Why do we have an array of array of array?
The innermost array is the audio data for a single channel. Each value represents the audio level at a
given time. But microphones can be stereo. In this case, we don't have one, but two channels. So we
wrap the Float32Arrays into an array that will contain one channel if the microphone is mono, or two
channels if the microphone is stereo.
But we can also have many input data sources! (many microphones, or many streams coming from
different WebRTC sources...). The third array is used to represent the different input sources.
On each call to the process function, we will sum all the input sources together to create a single
audio stream, and send this stream to the main thread via the postMessage API.
The code to instantiate the audio Worklet in the main file will look like this:
await this.audioContext.resume();
// Let's load the audio worklet processor (assuming the file is available in the pcm
await this.audioContext.audioWorklet.addModule(audioWorkletProcessorUrl);
// Instantiate the audio worklet node (we use the 'input-pcm-worklet-processor' stri
const workletNode = new AudioWorkletNode(this.audioContext, 'input-pcm-worklet-proce
// Connect the media stream as an input to our worklet node (this assume you already
// to your microphone)
const sourceNode = this.audioContext.createMediaStreamSource(mediaStream);
sourceNode.connect(workletNode);
Sending the audio chunks to the Realtime API is quite simple. We just need to call the
appendInputAudio method of the RealtimeClient. The data we pass must be a 16bit PCM audio buffer.
We can use the RealtimeUtils.floatTo16BitPCM method provided by the Realtime API to convert
the 32bit float audio data to 16bit PCM.
this.realtimeClient.appendInputAudio(RealtimeUtils.floatTo16BitPCM(audioBuffer));
However, just doing this will lead to a conversation that will fail after a few seconds. In our experience,
this is because we are calling the Realtime API too often.
Out of the box, when I tested the Audio Worklet, it generated Float32Array with 128 samples (i.e. 128
values in the array). Because we run at 24kHz, this means we are sending 24000 / 128 = 187.5 chunks
per second. This is probably a bit too much for the Realtime API.
We don't need to send audio chunks at a very high rate. One audio chunk every ~50ms should be more
than enough. That means we could target a chunk size of 24000 * 0.05 = 1200 samples.
We can simply buffer the audio chunks in the main thread and send them to the Realtime API when we
have enough data.
Managing interruptions
The Realtime API is sending audio chunks faster that we can play them. It means at some point in time,
we might have several seconds of audio chunks in our OutputAudioWorklet .
If we interrupt the conversation, this buffer is already on our computer and will still be played. This is not
what we want.
Hopefully, in VAD mode, the Realtime API is sending a conversation.interrupted event whenever it
detects the user has interrupted the conversation. We can use this event to clear the buffer. The buffer
is inside the OutputAudioWorklet so we need to send a message to the worklet to clear the buffer.
Let's say that if we receive the following event, we clear the buffer:
{
emptyBuffer: true
}
constructor() {
super();
this.port.onmessage = (event: MessageEvent) => {
// In case the event is a buffer clear event, we empty the buffer
if (event.data.emptyBuffer === true) {
this.audioQueue = [];
} else {
const data = event.data.pcmData;
if (data instanceof Float32Array) {
this.audioQueue.push(data);
} else {
console.error("Invalid data type received in worklet", event.dat
}
}
};
}
//...
}
On the main thread side, we can add an additional method to the OutputPCMStreamer class to clear
the buffer:
When we receive the conversation.interrupted event, we can call the resetAudioBuffer method
of the OutputPCMStreamer instance.
this.realtimeClient.on('conversation.interrupted', () => {
outputPCMStreamer.resetAudioBuffer();
});
What's remaining?
The system is working well, but there is still room for improvement.
Remember that OpenAI sends audio chunks faster than we can play them? Just imagine if the
conversation is interrupted. We are going to remove a lot of audio chunks for the buffer, but we did not
tell the Realtime API that those audio chunks have not been played. So the Realtime API will "think" we
heard everything it said, but we didn't.
Using WorkAdventure?
If you are using WorkAdventure (you should!) and are looking to implement a bot using the Realtime
API, we added a few methods to the scripting API to help you with that.
This will make things considerably easier for you as we take care of the audio worklet instantiation and
the audio stream management.
Conclusion
This is it! We have a fully working system that allows us to interact with bots via voice! The bots
reactivity is insane. The conversation is smooth and the tone of the bot is appropriate.
It took us about 3 days of work to get this first version working and we are very happy with the result. We
are now looking forward to integrating this system deeper into WorkAdventure, by adding more features
via the function calling feature of the Realtime API!
Next steps will be to have the bot better handle interruptions, and then react to the environment, guide
you through your office, etc...