Build a Voice-Based Chatbot with OpenAI, Vocode, and ElevenLabs

by Natalia Kuzminykh , Associate Data Science Content Editor

Why might we want to make an LLM talk? The concept of having a human-like conversation with an advanced AI model is an interesting idea that has many practical applications.

Voice-based models are transforming how we interact with technology, making interactions more natural and intuitive. By enabling AI to talk, we open the door to numerous practical applications, from accessibility to enhanced human-machine interactions. This guide explores how to create a voice-based chatbot using OpenAI, Vocode and ElevenLabs.

  • Accessibility/Assistive Technology: Text-to-speech (TTS) systems can be a lifeline for people with visual impairments, reading difficulties, or those who have lost their ability to speak. These systems transform written information into spoken words, thereby making things like read-outs of web pages, books and documents more accessible.

    Tools such as Apple’s Personal Voice take this idea a step further by allowing individuals to create a synthesized voice that closely mimics their own. This can be essential for communicating with family and friends via FaceTime and phone calls, using assistive communication apps and even for in-person conversations.

  • Human-Machine Interaction: Imagine asking your computer for advice, assistance, or information in the same way you would ask a colleague or a friend. By enabling LLMs to talk, we allow more natural and intuitive interfaces for human-computer interactions.

    This is the idea behind popular virtual assistants like Siri or Alexa. These voice-enabled bots can be particularly valuable in situations where typing isn’t practical, such as when driving or when quick inputs are required while on the go. The immediate and spoken responses can make interactions faster and more efficient, thereby boosting productivity and convenience.

Of course, there are potential nefarious uses as well, which we explored in our recent Webinar.

For this walkthrough, we’ll use OpenAI’s advanced language model. However, the same principles can apply if you decide to use an on-premise LLM of your choice. Here are the core steps we’ll be following:

  • Capture the user’s speech input and store it in a suitable audio format
  • Convert the recorded audio input into text using transcription tools (eg Vocode)
  • Send the transcribed text to ChatGPT to receive an intelligent and context-aware response
  • Transform the text-based response back into speech to continue the conversational flow naturally

In the video below you can see an example output from the code within this article:

If you like this you might also like the demo from our more in depth conversation about the use of voice models with LLMs.

Setting up the Environment

1. Install the Prerequisites

First, you’ll need to install the necessary Python libraries. Open your terminal and run the following commands:

pip install -q 'vocode[io]'
pip install --upgrade vocode==0.1.111
pip install -q openai==0.27.10
pip install -q elevenlabs==1.2.2

These commands will install Vocode, which is the main library for managing voice conversations. It also installs OpenAI for handling conversational flow, and ElevenLabs for text-to-speech synthesis.

2. Get API Keys

In Vocoder, which we use to orchestrate our chatbot, the core components of a conversation are divided into three essential parts: transcriber, agent, and synthesizer.

Each component has a specific role in managing the flow of a voice conversation and requires API keys to function.

  • Transcriber: The transcriber is responsible for converting spoken language into text. This allows us to understand what the user is saying in real time. There are several transcribers available, including DeepgramTranscriber, GoogleTranscriber, or AssemblyAITranscriber. Each has different parameters and features. To get started, sign up for a Deepgram account and navigate to the API section to obtain your DEEPGRAM_API_KEY.
  • Synthesizer: The synthesizer converts the text responses generated by the agent back into spoken language. It makes the conversation more accessible to the user. There are various synthesizers available, such as AzureSynthesizer and ElevenLabsSynthesizer. Each has its own set of configuration options, allowing you to control aspects such as voice, pitch and speed. To use ElevenLabs, create an account and obtain your ELEVENLABS_API_KEY from the API section after logging in.
  • Agent: The agent acts as the AI layer. It processes the input text, generates relevant responses and manages the conversational flow. Vocode supports different agent configurations, such as ChatGPTAgentConfig, LLMAgentConfig, and InformationRetrievalAgentConfig. For the ChatGPT Agent, you’ll need to register with OpenAI, and generate an API key from the API Keys section.

These components work together to create a smooth and interactive voice experience. Below, you could see an example of how you can implement them in your code:

import os
from vocode.streaming.agent.chat_gpt_agent import ChatGPTAgent
from vocode.streaming.transcriber.deepgram_transcriber import DeepgramTranscriber
from vocode.streaming.synthesizer.eleven_labs_synthesizer import ElevenLabsSynthesizer

os.environ['DEEPGRAM_API_KEY'] = "90..."
os.environ['OPENAI_API_KEY'] = "sk-.."
os.environ['ELEVENLABS_API_KEY'] = "a57..."

Building the Chatbot

Overall, building a voice-based chatbot from scratch involves several key stages, starting with capturing audio input, transcribing the speech to text, sending the text for processing to an AI model such as ChatGPT, and then providing a response back to the user.

Let’s take a closer look at each of these steps.

Step 1: Capturing Audio

The first step in our chatbot’s pipeline is to capture audio input from the user’s microphone. For this, we will use a class MicrophoneInput that leverages open-source libraries like sounddevice and janus to handle audio streams effectively.

MicrophoneInput is designed to interface directly with your microphone, capturing live audio data that can then be processed in real time. By initializing an audio stream from the default input device with a specified sampling rate and chunk size, this class ensures that the audio data is ready for immediate use or further processing.

One of the key features of this method is its ability to save the captured audio into a WAV file on your local system. This not only facilitates live audio processing but also creates a tangible file that can be used for testing and debugging purposes.

Next, we send the captured audio to a transcriber, which converts speech-to-text.

class MicrophoneInput(BaseInputDevice):
    def __init__(self, device_info: dict, sampling_rate: Optional[int] = None,
        chunk_size: int = 2048, microphone_gain: int = 1,
        output_file: Optional[str] = None):

        self.device_info = device_info
        super().__init__(
            sampling_rate or (typing.cast(int, self.device_info.get(
                        "default_samplerate", self.44100))),
            AudioEncoding.LINEAR16, chunk_size)

        self.stream = sd.InputStream(dtype=np.int16, channels=1,
            samplerate=self.sampling_rate, blocksize=self.chunk_size,
            device=int(self.device_info["index"]),
            callback=self._stream_callback)
        self.stream.start()
        self.queue: janus.Queue[bytes] = janus.Queue()
        self.microphone_gain = microphone_gain
        self.output_file = output_file

        if self.output_file:
            self.wave_file = wave.open(self.output_file, 'wb')
            self.wave_file.setnchannels(1)
            self.wave_file.setsampwidth(2)
            self.wave_file.setframerate(self.sampling_rate)

    def _stream_callback(self, in_data: np.ndarray, *_args):
        if self.microphone_gain > 1:
            in_data = in_data * (2 ** self.microphone_gain)
        else:
            in_data = in_data // (2 ** self.microphone_gain)
        audio_bytes = in_data.tobytes()
        self.queue.sync_q.put_nowait(audio_bytes)

        if self.output_file:
            self.wave_file.writeframes(audio_bytes)

    async def get_audio(self) -> bytes:
        return await self.queue.async_q.get()

    def close(self):
        if self.output_file:
            self.wave_file.close()

    @classmethod
    def from_default_device(cls, sampling_rate: Optional[int] = None, output_file: Optional[str] = None):
        return cls(sd.query_devices(kind="input"), sampling_rate, output_file=output_file)

Step 2: Transcribing Audio to Text

Once we have the audio file, the next step is to transcribe it into text. We’ll use Vocode’s StreamingConversation and DeepgramTranscriber components for this task:

  • DeepgramTranscriber converts spoken language into text using Deep Gram’s API. It establishes a WebSocket connection to the Deepgram service, sending audio chunks and receiving transcriptions in return.

    The DeepgramTranscriber can handle various audio encodings and sampling rates, and can optionally apply downsampling to reduce bandwidth usage. It processes each piece of audio and determines if the transcription is complete based on certain criteria, such as punctuation (configured through PunctuationEndpointingConfig) or time intervals. The transcriber can output transcriptions as either interim or final results, ensuring accurate and real-time speech recognition tailored to various use cases.

    transcriber=DeepgramTranscriber(
      DeepgramTranscriberConfig.from_input_device(
        microphone_input, endpointing_config=PunctuationEndpointingConfig()
      )
    )
    

    One of the notable features of this setup is its support for multiple languages, such as Spanish. By configuring the DeepgramTranscriber with appropriate language models, the system can transcribe speech in different languages, broadening its applicability and making it accessible to a diverse user base. This multilingual capability is particularly beneficial in creating inclusive voice-based applications that cater to users across different linguistic backgrounds.

    def get_deepgram_url(self):
        url_params = {
            "encoding": encoding,
            "sample_rate": self.transcriber_config.sampling_rate,
            "channels": 1,
            "interim_results": "true",
        }
        extra_params = {}
        if self.transcriber_config.language:
            extra_params["language"] = self.transcriber_config.language
        url_params.update(extra_params)
        return f"wss://api.deepgram.com/v1/listen?{urlencode(url_params)}"
    
  • StreamingConversation is a core component that manages the real-time interaction between different elements involved in a voice conversation. It orchestrates the flow from capturing audio input through a microphone, transcribing the audio to text, processing the text with an AI agent, and finally converting the AI-generated text back to speech to be output through speakers. This class abstracts much of the complexity behind handling interactions while Vocode manages the asynchronous streaming of data, response generation, interruptions and inaccuracies.

    In a typical setup, the StreamingConversation integrates a _Transcriber_to convert spoken language into text. The text is then fed into an Agent, such as a ChatGPTAgent, which processes the text to generate responses. These responses are passed to a Synthesizer, which converts the text back into speech

    conversation = StreamingConversation(
            output_device=speaker_output,
            transcriber=DeepgramTranscriber(
                DeepgramTranscriberConfig.from_input_device(
                    microphone_input,
                    endpointing_config=PunctuationEndpointingConfig()
                )),
            agent=ChatGPTAgent(
                ChatGPTAgentConfig(
                    initial_message=BaseMessage(text="Hello!"),
                    prompt_preamble="Have a pleasant conversation about life",
                )),
            synthesizer=ElevenLabsSynthesizer(
            elevenlabs_config, aiohttp_session=session),
            logger=logger
        )
    

Step 3: ChatGPTAgent Integration

In the next step, we integrate OpenAI’s ChatGPT to handle the transcribed text and generate responses. ChatGPTAgent utilizes the Chat Completions API to process user inputs and deliver appropriate replies. This API allows developers to send a request containing a list of messages and an API key, receiving a model-generated message in response.

def create_first_response(self, first_prompt):
  messages = [(
          [{"role": "system", "content": self.agent_config.prompt_preamble}]
          if self.agent_config.prompt_preamble
          else []
      ) + [{"role": "user", "content": first_prompt}]]

  parameters = self.get_chat_parameters(messages)
  return openai.ChatCompletion.create(**parameters)

The Chat Completions API operates by taking a list of messages as input, and returning a model-generated message as output. This setup facilitates both single-turn tasks and multi-turn conversations. For example, a conversation might start with a system message setting the assistant’s behavior, followed by alternating user and assistant messages. The system message is critical as it sets the assistant’s behavior and personality, such as instructing the assistant to keep responses concise or to adopt a particular tone.

agent = ChatGPTAgent(
  ChatGPTAgentConfig(
    initial_message=BaseMessage(text="Ahoy! I be Alex, yer trusty pirate companion!"),
    prompt_preamble="Speak like a pirate and engage in a hearty conversation about the pirate life with Alex the chatbot.",
  )
)

In this example, initial_message sets the first message that the AI will say, establishing the starting point of the conversation, while prompt_preamble provides context and instructions to shape the AI’s responses. This can include directives such as keeping responses brief or focusing on specific topics. For instance, you could instruct the assistant to keep responses under 30 words, ensuring concise communication. In our example, we instruct the bot to lead a conversation as a pirate character.

Step 4: Converting Text to Audio

Finally, we convert the text responses generated by ChatGPT into audio. This is achieved using the ElevenLabsSynthesizer class from Vocode, which interfaces with ElevenLabs’ text-to-speech API. By doing this, the chatbot can deliver vocal responses, enhancing the user interaction experience.

To start, we initialize the audio input and output devices using Vocode’s helper functions. The create_streaming_microphone_input_and_speaker_output function configures the microphone and speaker for audio capture and playback. This setup ensures that the chatbot can both listen and respond to the user. The function takes various parameters, including whether to use default devices and logger information, and it returns the initialized microphone and speaker objects.

async with aiohttp.ClientSession() as session:
    microphone_input, speaker_output = create_streaming_microphone_input_and_speaker_output(
        use_default_devices=True,
        logger=logger,
        output_file='output_audio.wav'
    )
    print("Microphone and speaker initialized")

The above code snippet demonstrates how to initialize the microphone and speaker. By setting use_default_devices=True the system automatically selects the default input and output devices. The output_file parameter specifies where to save the captured audio, providing a persistent record of the interaction.

Next, we use the ElevenLabsSynthesizer to convert the text generated by the ChatGPT agent into speech. The synthesizer is initialized with configuration settings that include API keys and voice preferences

synthesizer = ElevenLabsSynthesizer(elevenlabs_config, aiohttp_session=session)

Once initialized, the synthesizer can process text messages and generate corresponding audio.

Full Code and Results

The full code is shown below:

import os, aiohttp, asyncio, signal, logging
from vocode.streaming.streaming_conversation import StreamingConversation
from vocode.helpers import create_streaming_microphone_input_and_speaker_output
from vocode.streaming.transcriber import *
from vocode.streaming.models.transcriber import*
from vocode.streaming.transcriber.deepgram_transcriber import DeepgramTranscriber
from vocode.streaming.agent import *
from vocode.streaming.agent.chat_gpt_agent import ChatGPTAgent
from vocode.streaming.synthesizer import*
from vocode.streaming.models.synthesizer import ElevenLabsSynthesizerConfig
from vocode.streaming.models.agent import *
from vocode.streaming.synthesizer.eleven_labs_synthesizer import ElevenLabsSynthesizer
from vocode.streaming.models.message import BaseMessage

os.environ['DEEPGRAM_API_KEY'] = "90…"
os.environ['OPENAI_API_KEY'] = "sk-..."
os.environ['ELEVENLABS_API_KEY'] = "a57…"

logging.basicConfig()
logger = logging.getLogger(**name**)
logger.setLevel(logging.DEBUG)

async def main():
    assert os.getenv('DEEPGRAM_API_KEY'), "Missing DEEPGRAM_API_KEY"
    assert os.getenv('OPENAI_API_KEY'), "Missing OPENAI_API_KEY"
    assert os.getenv('ELEVENLABS_API_KEY'), "Missing ELEVENLABS_API_KEY"

    async with aiohttp.ClientSession() as session:
        microphone_input, speaker_output = create_streaming_microphone_input_and_speaker_output(
            use_default_devices=True,
            logger=logger,
            output_file='output_audio.wav'
        )
        print("Microphone and speaker initialized")

        elevenlabs_config = ElevenLabsSynthesizerConfig.from_output_device(
            output_device=speaker_output,
            api_key=os.getenv('ELEVENLABS_API_KEY')
        )

        conversation = StreamingConversation(
            output_device=speaker_output,
            transcriber=DeepgramTranscriber(
                DeepgramTranscriberConfig.from_input_device(
                    microphone_input, endpointing_config=PunctuationEndpointingConfig()
                )),
            agent=ChatGPTAgent(
                ChatGPTAgentConfig(
                    initial_message=BaseMessage(text="Ahoy! I be Alex, yer trusty pirate companion!"),
                    prompt_preamble="Speak like a pirate and engage in a hearty conversation about the pirate life with Alex the chatbot.",
                )),
            synthesizer=ElevenLabsSynthesizer(elevenlabs_config, aiohttp_session=session),
            logger=logger,
        )

        async def shutdown(signal, loop):
            await conversation.terminate()
            loop.stop()

        loop = asyncio.get_running_loop()

        def signal_handler(sig):
            loop.call_soon_threadsafe(asyncio.create_task, shutdown(sig, loop))

        for s in (signal.SIGINT, signal.SIGTERM):
            signal.signal(s, signal_handler)

        await conversation.start()
        print("Conversation started, press Ctrl+C to end")

        while conversation.is_active():
            chunk = await microphone_input.get_audio()
            result = conversation.receive_audio(chunk)
            if asyncio.iscoroutine(result):
                await result
            elif result is not None:
                print(f"Unexpected result from receive_audio: {result}")

if **name** == "**main**":
    loop = asyncio.ProactorEventLoop()
    asyncio.set_event_loop(loop)
    loop.run_until_complete(main())

Once you run it, you could start communication with the chatbot and get a conversation like this:

INFO:**main**:Using microphone input device: Gruppo microfoni (Intel(r) Smart
INFO:**main**:Using speaker output device: Speaker (Realtek(R) Audio)
Microphone and speaker initialized
Conversation started, press Ctrl+C to end
DEBUG:**main**:[gFVPf8zv01Bc3-13st4dXA] Synthesizing speech for message
DEBUG:**main**:[gFVPf8zv01Bc3-13st4dXA] Message sent: Ahoy! I be Alex, yer trusty pirate companion!
DEBUG:**main**:[gFVPf8zv01Bc3-13st4dXA] Human started speaking
DEBUG:**main**:[gFVPf8zv01Bc3-13st4dXA] Got transcription:  Hi, Alex. What do ye like to do?, confidence: 0.986084
DEBUG:**main**:[gFVPf8zv01Bc3-13st4dXA] Synthesizing speech for message
DEBUG:**main**:[gFVPf8zv01Bc3-13st4dXA] Message sent: Arr, me heart be yearnin' for adventures on the high seas!
DEBUG:**main**:[gFVPf8zv01Bc3-13st4dXA] Message sent: Sailin' on a pirate ship, discoverin' hidden treasure, and swashbucklin' me way through dangerous encounters, that be what I love to do!
DEBUG:**main**:[gFVPf8zv01Bc3-13st4dXA] Message sent: But tell me, laddie, what be yer favorite thing about the pirate life?
DEBUG:**main**:[gFVPf8zv01Bc3-13st4dXA] Human started speaking
DEBUG:**main**:[gFVPf8zv01Bc3-13st4dXA] Got transcription:  I like to explore new places. What's the best treasure ye ever found?, confidence: 0.0
DEBUG:**main**:[gFVPf8zv01Bc3-13st4dXA] Synthesizing speech for message
DEBUG:**main**:[gFVPf8zv01Bc3-13st4dXA] Message sent: Ah, explorin' new places be a grand adventure indeed!
DEBUG:**main**:[gFVPf8zv01Bc3-13st4dXA] Message sent: As a pirate, I've stumbled upon many wondrous sights.
DEBUG:**main**:[gFVPf8zv01Bc3-13st4dXA] Message sent: One of the finest treasures I've discovered be a hidden cove with crystal-clear waters, a sandy beach, and a secret cave filled with gold doubloons.
DEBUG:**main**:[gFVPf8zv01Bc3-13st4dXA] Message sent: 'Twas a sight to behold, I tell ye!
DEBUG:**main**:[gFVPf8zv01Bc3-13st4dXA] Message sent: But as a true pirate, ye must keep yer treasures concealed and guarded.
DEBUG:**main**:[gFVPf8zv01Bc3-13st4dXA] Message sent: So, the location of me best finds be a secret only shared among the brethren of the seas.

Conclusion

As demonstrated in our example, creating a voice-based chatbot is an exciting journey filled with both challenges and opportunities. You may encounter issues such as audio quality concerns, response delays, inaccurate transcriptions, or unexpected terminations. However, with the right setup, configurations, and perseverance, these challenges can be overcome.

The technology behind chatbots has the potential to revolutionize our interactions with machines, making them more natural and intuitive. Chatbots can be used for a variety of purposes, including customer support, service, education, and smart home integration, as well as professional services. With continued refinement and enhancement, the possibilities are endless, limited only by our creativity and innovation.

More articles

Revolutionizing IVR Systems: Attaching Voice Models to LLMs

Discover how attaching voice models to large language models (LLMs) revolutionizes IVR systems for superior customer interactions.

Read more

Practical Use Cases for Retrieval-Augmented Generation (RAG)

Join our webinar to explore Retrieval Augmented Generation (RAG) use cases and advanced LLM techniques to enhance AI applications in 2024.

Read more
}