Don’t spend a **** of money on generic voice AI providers. Build it yourself.
What are the basic components?
At the heart of a voice AI stack is an LLM that engages in the conversation by generating responses. Whether your LLM is a custom RAG or some other fancy architecture is up to you.
LLMs operate on text, so you need a second model that transcribes speech into text.
There’s a tradeoff between quality and speed, but if you want to build a real-time voice application these days, I don’t see anything else being used than Deepgram.
What’s next?
You need to take your generated text and synthesize a voice — unless you want to read the responses aloud to your clients. This is NOT what Paul Graham means by saying, “Do things that don’t scale.”
There are amazing voice APIs, e.g., ElevenLabs high-quality voices and recently launched Cartesia with BLAZINGLY fast voice generation.
But how do you move audio between the client and AI/server?
WebRTC is the standard for sending real-time data online. Daily has an amazing WebRTC platform with client libraries that abstract playing audio for you (try that in Swift).
So, what’s missing?
We covered HOW to generate a response, but you must also decide WHEN to respond.
Use a voice activity detection model like Silero that tells if your client is speaking. During silence, you can query a small, fast LLM to evaluate semantically if the client really finished speaking, and your AI should respond.
Turn-taking is very domain-specific.
At Sonia (YC W24), we’re building an AI therapist, and pace and turn-taking fundamentally differ from customer support or sales agents.
To make turn-taking work really well, you want to incorporate the full audio signal, as the spoken word contains important information beyond just the text. Depending on your domain, a custom-built turn-taking model might make sense.
Care about latency.
Perplexity CEO Aravind Srinivas cares A LOT about latency. Improve latency by hosting models yourself. Baseten has a simple interface for model hosting with a large library of open-source models.
But will GPT-4o voice mode come and haunt us?
Embrace the progress in the field.
It will be hard to beat multimodal model latency as long as you use a similarly complex LLM in your custom stack. But I don’t think latency matters since you get 1-2 second latency building it yourself.
However, depending on the application, there will be huge benefits of feeding raw (or preprocessed) audio into the model without having text as the bottleneck modality.
Many apps already go beyond voice with video avatars like Tavus. You can expect increased engagement for many applications (e.g. healthcare, education).
The space is quite hot right now. Our YC friends at Arini (YC W24) handle phone calls for dentists, and Scritch (YC W24) recently launched AI assistants for vets.
My calendar is open if you want to connect or get advice!