An AI application that can hold conversation in realtime across a wide variety of topics.
The goal of building AI Voice Caller was to build an interface where the LLM can interact with a user over the voice, and have the ability to hold conversation in realtime, thus accelerating data gathering with ability to generate new insights from historical conversational data.
Voice Activity Detection - For voice activity detection I used pyannote/segmentation model from hugginface.
Transcription - For transcription, I used Systran/faster-whisper-large-v3 model from hugginface.
LLM - For LLM I choose GPT-4o-mini under the Autogen Framework, the main reason for me deciding to go in with Autogen Framework was that I wanted the ability to have multiple experts that can jump in and out of the conversation.
Text to Speech - For converting the response to speech, I choose to use ElevenAI.
The main challenge in building AI Caller was to process the voice input with low latency and accurately. For example when a user is speaking, it was very important to not break the conversational flow and wait for pauses, before when the user actually stops talking.
AI caller provides us a very capable AI model that can handle user interactions over a wide variety of mediums, such as WhatsApp Calls, Browsers, Phone Calls or Mobile Apps. On top of that, since al the conversations as completely transcribed by design, older conversations can be used to derive newer insights without having the need to repeat the conversations.