Building a Real-time Microsoft Teams Voice Bot with Gemini Flash

This document outlines the architectural path and references for building a real-time, voice-to-voice bot for Microsoft Teams using Google's Gemini Multimodal Live API (Gemini 1.5/2.0 Flash).

Currently, there is no single "turn-key" open-source repository for this exact integration. Instead, developers combine three core technologies:

1. The AI Engine: Gemini Multimodal Live API

The Gemini Multimodal Live API allows for streaming raw audio and video over WebSockets, providing sub-second audio responses.

Reference: Google Gen AI Python SDK
Key Feature: Native support for real-time voice input/output, and contextual understanding.

2. The Orchestration Framework: Pipecat

Pipecat is currently the leading open-source framework for building real-time multimodal AI agents.

Reference: pipecat-ai/pipecat
Why use it?: It handles the complex WebSocket communication with Gemini (via its GeminiMultimodalLiveAdapter) and provides adapters for handling various media transports, voice activity detection (VAD), and user interruptions.

3. The Teams Connector (Media Extraction)

To get raw audio from a Teams meeting into your AI pipeline, you have two primary options:

Option A: Azure Communication Services (ACS) - Modern & Cross-Platform

This is the recommended approach for Python/Node.js developers.

How it works: Use the ACS Call Automation SDK to have a bot join the meeting. ACS can stream the raw audio over WebSockets directly to your Pipecat/Gemini server.
Reference: Azure Communication Call Automation (Python SDK)

Option B: Microsoft Graph Communications SDK - Legacy/High Performance

This provides raw media access but requires a C#/.NET environment hosted on Windows Server.

How it works: An Application-hosted Media Bot receives raw PCM audio frames directly via the .NET SDK.
Reference: microsoftgraph/microsoft-graph-comms-samples (Look for the AudioVideoPlaybackBot example or LocalMediaSamples).

High-Level Architecture Workflow

Bot Registration: Register a bot in Azure and grant Calls.AccessMedia.All permission.
Meeting Join: Bot joins the Teams meeting URL using ACS or the Graph SDK.
Media Stream: Audio (usually 16kHz PCM chunks) is extracted and streamed to your Python/Node server via WebSockets.
AI Processing: Pipecat receives the stream and forwards it to the Gemini Multimodal Live API.
Response: Gemini generates an audio response stream, which is sent back through Pipecat to the Teams meeting stream.

oneryalcin/teams-gemini-bot.md

Select an option

No results found