Overview
In Pranthora, an agent interaction flows through a carefully designed pipeline that handles both speech-based and text-based conversations with high accuracy and responsiveness. There are two primary types of pipelines:- ๐๏ธ Speech Pipeline โ Used for real-time voice conversations.
- ๐ฌ Text Pipeline โ Used for chat-based or text-only interactions.
๐๏ธ Speech Pipeline
The Speech Pipeline powers natural and responsive conversations between users and agents.It processes the entire flow from a userโs voice input to an agentโs spoken response, following this sequence:
1. User Audio Input
User audio can originate from:- Twilio (for phone calls)
- Web (for browser-based calls)
2. Voice Activity Detection (VAD)
The cleaned audio is analyzed through the Voice Activity Detection (VAD) stage.This stage identifies when the user starts and stops speaking โ effectively detecting turn boundaries using a combination of:
- Acoustic cues (voice energy, silence)
- Semantic understanding (end-of-sentence meaning)
3. Transcription
Once user speech is detected, it is sent to the transcription model in streaming mode.- The model generates partial transcripts in near real-time.
- These transcripts are continuously updated and refined as the user speaks.
- This enables the agent to start reasoning and responding without waiting for the user to finish completely.
4. Agent / Assistant Processing
The transcribed text is then passed to the Agent (LLM/Assistant), which:- Understands user intent.
- Generates a contextual and relevant response.
- Optionally performs external tool or integration calls (e.g., MCP, HTTP, or N8N workflows).
5. Text-to-Speech (TTS)
The agentโs text output is sent to the Text-to-Speech (TTS) module, which:- Converts the text into natural-sounding audio.
- Supports streaming synthesis, so the user hears the response with minimal delay.
6. Assistant Audio Output
Finally, the generated speech is streamed back to the user, either through:- The browser audio interface, or
- A Twilio-powered phone call.
๐ฌ Text Pipeline
The Text Pipeline is a simplified version of the speech pipeline โ perfect for chat testing or text-based interfaces.- User Text Input โ The user types a message directly.
- Agent Processing โ The text is passed to the same LLM/assistant logic used in the speech pipeline.
- External Calls โ Integrations and tool executions occur as needed.
- Text Output โ The agentโs response is returned directly as text (no TTS step).
Summary
| Stage | Speech Pipeline | Text Pipeline |
|---|---|---|
| Input | User voice (Twilio/Web) | Text message |
| Processing | VAD + Transcription + Agent + TTS | Agent only |
| Output | Agent voice (streamed) | Text reply |
| Interruption Handling | Dynamic speech interruption detection | Not applicable |
โ๏ธ Key Takeaways
- Both pipelines share the same Agent intelligence and integration logic.
- Speech pipeline adds real-time streaming, turn detection, and TTS playback.
- User interruptions are handled gracefully, ensuring smooth and natural interaction flow.
- Ideal for both voice-first and text-first conversational experiences.
