AICaller - AI Voice Interviewer | TheDesiLoops

Overview

The goal of building AI Voice Caller was to build an interface where the LLM can interact with a user over the voice, and have the ability to hold conversation in realtime, thus accelerating data gathering with ability to generate new insights from historical conversational data.

Technologies Used

Voice Activity Detection - For voice activity detection I used pyannote/segmentation model from hugginface.
Transcription - For transcription, I used Systran/faster-whisper-large-v3 model from hugginface.
LLM - For LLM I choose GPT-4o-mini under the Autogen Framework, the main reason for me deciding to go in with Autogen Framework was that I wanted the ability to have multiple experts that can jump in and out of the conversation.
Text to Speech - For converting the response to speech, I choose to use ElevenAI.

Features

Autogen - The ability to hold multi agent workflows to orchestrate meaningful and highly productive conversations.
Noise Cancellation and Voice Activity Detection allows to minimize and smooth-out the voice input processing.
Low-Cost - Keeps the cost of operations lower by a factor of 10x when compared with other similar solutions.

Challenges and Solutions

The main challenge in building AI Caller was to process the voice input with low latency and accurately. For example when a user is speaking, it was very important to not break the conversational flow and wait for pauses, before when the user actually stops talking.

Impact

AI caller provides us a very capable AI model that can handle user interactions over a wide variety of mediums, such as WhatsApp Calls, Browsers, Phone Calls or Mobile Apps. On top of that, since al the conversations as completely transcribed by design, older conversations can be used to derive newer insights without having the need to repeat the conversations.

Hey 👋

Tell me a little about yourself....