This document outlines the architectural path and references for building a real-time, voice-to-voice bot for Microsoft Teams using Google's Gemini Multimodal Live API (Gemini 1.5/2.0 Flash).
Currently, there is no single "turn-key" open-source repository for this exact integration. Instead, developers combine three core technologies:
The Gemini Multimodal Live API allows for streaming raw audio and video over WebSockets, providing sub-second audio responses.
- Reference: Google Gen AI Python SDK
- Key Feature: Native support for real-time voice input/output, and contextual understanding.
Pipecat is currently the leading open-source framework for building real-time multimodal AI agents.
- Reference: pipecat-ai/pipecat
- Why use it?: It handles the complex WebSocket communication with Gemini (via its
GeminiMultimodalLiveAdapter) and provides adapters for handling various media transports, voice activity detection (VAD), and user interruptions.
To get raw audio from a Teams meeting into your AI pipeline, you have two primary options:
This is the recommended approach for Python/Node.js developers.
- How it works: Use the ACS Call Automation SDK to have a bot join the meeting. ACS can stream the raw audio over WebSockets directly to your Pipecat/Gemini server.
- Reference: Azure Communication Call Automation (Python SDK)
This provides raw media access but requires a C#/.NET environment hosted on Windows Server.
- How it works: An Application-hosted Media Bot receives raw PCM audio frames directly via the
.NETSDK. - Reference: microsoftgraph/microsoft-graph-comms-samples (Look for the
AudioVideoPlaybackBotexample orLocalMediaSamples).
- Bot Registration: Register a bot in Azure and grant
Calls.AccessMedia.Allpermission. - Meeting Join: Bot joins the Teams meeting URL using ACS or the Graph SDK.
- Media Stream: Audio (usually 16kHz PCM chunks) is extracted and streamed to your Python/Node server via WebSockets.
- AI Processing: Pipecat receives the stream and forwards it to the Gemini Multimodal Live API.
- Response: Gemini generates an audio response stream, which is sent back through Pipecat to the Teams meeting stream.
Great article. Re: Duplex vs Gemini/OpenAI realtime: full-duplex models excel at native bidirectional, same-step audio streaming (better interruption/backchannel timing) because listening and speaking are modeled jointly. Commercial realtime APIs (Gemini/OpenAI) often still use componentized pipelines (ASR/VAD -> LLM -> TTS) with robust features, managed infra, guardrails, tool-use and multimodal integrations. Tradeoff currently: open duplexs may be more latency-dense and lower-level to run; platform realtime models are usually easier to ship, scale, and productionize today. Happy to do a small architecture comparison matrix if useful.