Skip to main content

Overview

In Pranthora, an agent interaction flows through a carefully designed pipeline that handles both speech-based and text-based conversations with high accuracy and responsiveness. There are two primary types of pipelines:
  • ๐ŸŽ™๏ธ Speech Pipeline โ€“ Used for real-time voice conversations.
  • ๐Ÿ’ฌ Text Pipeline โ€“ Used for chat-based or text-only interactions.

๐ŸŽ™๏ธ Speech Pipeline

The Speech Pipeline powers natural and responsive conversations between users and agents.
It processes the entire flow from a userโ€™s voice input to an agentโ€™s spoken response, following this sequence:

1. User Audio Input

User audio can originate from:
  • Twilio (for phone calls)
  • Web (for browser-based calls)
The incoming audio is resampled to a consistent format and noise cancellation is applied to ensure clarity before further processing.

2. Voice Activity Detection (VAD)

The cleaned audio is analyzed through the Voice Activity Detection (VAD) stage.
This stage identifies when the user starts and stops speaking โ€” effectively detecting turn boundaries using a combination of:
  • Acoustic cues (voice energy, silence)
  • Semantic understanding (end-of-sentence meaning)
This ensures that the system knows precisely when to respond or when to continue listening.

3. Transcription

Once user speech is detected, it is sent to the transcription model in streaming mode.
  • The model generates partial transcripts in near real-time.
  • These transcripts are continuously updated and refined as the user speaks.
  • This enables the agent to start reasoning and responding without waiting for the user to finish completely.

4. Agent / Assistant Processing

The transcribed text is then passed to the Agent (LLM/Assistant), which:
  • Understands user intent.
  • Generates a contextual and relevant response.
  • Optionally performs external tool or integration calls (e.g., MCP, HTTP, or N8N workflows).
During this step, the system keeps listening for user interruptions โ€” if the user begins speaking again, the agentโ€™s response is interrupted and cleared, allowing the system to immediately process the new input.

5. Text-to-Speech (TTS)

The agentโ€™s text output is sent to the Text-to-Speech (TTS) module, which:
  • Converts the text into natural-sounding audio.
  • Supports streaming synthesis, so the user hears the response with minimal delay.
The result is a fluid, real-time, back-and-forth voice experience.

6. Assistant Audio Output

Finally, the generated speech is streamed back to the user, either through:
  • The browser audio interface, or
  • A Twilio-powered phone call.
This completes one full speech interaction loop, from user voice โ†’ agent reasoning โ†’ agent voice.

๐Ÿ’ฌ Text Pipeline

The Text Pipeline is a simplified version of the speech pipeline โ€” perfect for chat testing or text-based interfaces.
  1. User Text Input โ€“ The user types a message directly.
  2. Agent Processing โ€“ The text is passed to the same LLM/assistant logic used in the speech pipeline.
  3. External Calls โ€“ Integrations and tool executions occur as needed.
  4. Text Output โ€“ The agentโ€™s response is returned directly as text (no TTS step).
This pipeline mirrors the logic of the speech pipeline but excludes all voice-related components (audio input, VAD, transcription, and TTS).

Summary

StageSpeech PipelineText Pipeline
InputUser voice (Twilio/Web)Text message
ProcessingVAD + Transcription + Agent + TTSAgent only
OutputAgent voice (streamed)Text reply
Interruption HandlingDynamic speech interruption detectionNot applicable

โš™๏ธ Key Takeaways

  • Both pipelines share the same Agent intelligence and integration logic.
  • Speech pipeline adds real-time streaming, turn detection, and TTS playback.
  • User interruptions are handled gracefully, ensuring smooth and natural interaction flow.
  • Ideal for both voice-first and text-first conversational experiences.