Overview
In Pranthora, an agent interaction flows through a carefully designed pipeline that handles both speech-based and text-based conversations with high accuracy and responsiveness. There are two primary types of pipelines:- ποΈ Speech Pipeline β Used for real-time voice conversations.
- π¬ Text Pipeline β Used for chat-based or text-only interactions.
ποΈ Speech Pipeline
The Speech Pipeline powers natural and responsive conversations between users and agents.It processes the entire flow from a userβs voice input to an agentβs spoken response, following this sequence:
1. User Audio Input
User audio can originate from:- Twilio (for phone calls)
- Web (for browser-based calls)
2. Voice Activity Detection (VAD)
The cleaned audio is analyzed through the Voice Activity Detection (VAD) stage.This stage identifies when the user starts and stops speaking β effectively detecting turn boundaries using a combination of:
- Acoustic cues (voice energy, silence)
- Semantic understanding (end-of-sentence meaning)
3. Transcription
Once user speech is detected, it is sent to the transcription model in streaming mode.- The model generates partial transcripts in near real-time.
- These transcripts are continuously updated and refined as the user speaks.
- This enables the agent to start reasoning and responding without waiting for the user to finish completely.
4. Agent / Assistant Processing
The transcribed text is then passed to the Agent (LLM/Assistant), which:- Understands user intent.
- Generates a contextual and relevant response.
- Optionally performs external tool or integration calls (e.g., MCP, HTTP, or N8N workflows).
5. Text-to-Speech (TTS)
The agentβs text output is sent to the Text-to-Speech (TTS) module, which:- Converts the text into natural-sounding audio.
- Supports streaming synthesis, so the user hears the response with minimal delay.
6. Assistant Audio Output
Finally, the generated speech is streamed back to the user, either through:- The browser audio interface, or
- A Twilio-powered phone call.
π¬ Text Pipeline
The Text Pipeline is a simplified version of the speech pipeline β perfect for chat testing or text-based interfaces.- User Text Input β The user types a message directly.
- Agent Processing β The text is passed to the same LLM/assistant logic used in the speech pipeline.
- External Calls β Integrations and tool executions occur as needed.
- Text Output β The agentβs response is returned directly as text (no TTS step).
Summary
| Stage | Speech Pipeline | Text Pipeline |
|---|---|---|
| Input | User voice (Twilio/Web) | Text message |
| Processing | VAD + Transcription + Agent + TTS | Agent only |
| Output | Agent voice (streamed) | Text reply |
| Interruption Handling | Dynamic speech interruption detection | Not applicable |
βοΈ Key Takeaways
- Both pipelines share the same Agent intelligence and integration logic.
- Speech pipeline adds real-time streaming, turn detection, and TTS playback.
- User interruptions are handled gracefully, ensuring smooth and natural interaction flow.
- Ideal for both voice-first and text-first conversational experiences.
