Skip to content

Instantly share code, notes, and snippets.

@oneryalcin
Created May 13, 2026 07:19
Show Gist options
  • Select an option

  • Save oneryalcin/a75b1c824ec99a6bfffe7daf58b9531d to your computer and use it in GitHub Desktop.

Select an option

Save oneryalcin/a75b1c824ec99a6bfffe7daf58b9531d to your computer and use it in GitHub Desktop.
Architecture and references for building a real-time Microsoft Teams voice bot using Gemini Flash Multimodal Live API

Building a Real-time Microsoft Teams Voice Bot with Gemini Flash

This document outlines the architectural path and references for building a real-time, voice-to-voice bot for Microsoft Teams using Google's Gemini Multimodal Live API (Gemini 1.5/2.0 Flash).

Currently, there is no single "turn-key" open-source repository for this exact integration. Instead, developers combine three core technologies:

1. The AI Engine: Gemini Multimodal Live API

The Gemini Multimodal Live API allows for streaming raw audio and video over WebSockets, providing sub-second audio responses.

  • Reference: Google Gen AI Python SDK
  • Key Feature: Native support for real-time voice input/output, and contextual understanding.

2. The Orchestration Framework: Pipecat

Pipecat is currently the leading open-source framework for building real-time multimodal AI agents.

  • Reference: pipecat-ai/pipecat
  • Why use it?: It handles the complex WebSocket communication with Gemini (via its GeminiMultimodalLiveAdapter) and provides adapters for handling various media transports, voice activity detection (VAD), and user interruptions.

3. The Teams Connector (Media Extraction)

To get raw audio from a Teams meeting into your AI pipeline, you have two primary options:

Option A: Azure Communication Services (ACS) - Modern & Cross-Platform

This is the recommended approach for Python/Node.js developers.

Option B: Microsoft Graph Communications SDK - Legacy/High Performance

This provides raw media access but requires a C#/.NET environment hosted on Windows Server.

  • How it works: An Application-hosted Media Bot receives raw PCM audio frames directly via the .NET SDK.
  • Reference: microsoftgraph/microsoft-graph-comms-samples (Look for the AudioVideoPlaybackBot example or LocalMediaSamples).

High-Level Architecture Workflow

  1. Bot Registration: Register a bot in Azure and grant Calls.AccessMedia.All permission.
  2. Meeting Join: Bot joins the Teams meeting URL using ACS or the Graph SDK.
  3. Media Stream: Audio (usually 16kHz PCM chunks) is extracted and streamed to your Python/Node server via WebSockets.
  4. AI Processing: Pipecat receives the stream and forwards it to the Gemini Multimodal Live API.
  5. Response: Gemini generates an audio response stream, which is sent back through Pipecat to the Teams meeting stream.
@oneryalcin

Copy link
Copy Markdown
Author

Great article. Re: Duplex vs Gemini/OpenAI realtime: full-duplex models excel at native bidirectional, same-step audio streaming (better interruption/backchannel timing) because listening and speaking are modeled jointly. Commercial realtime APIs (Gemini/OpenAI) often still use componentized pipelines (ASR/VAD -> LLM -> TTS) with robust features, managed infra, guardrails, tool-use and multimodal integrations. Tradeoff currently: open duplexs may be more latency-dense and lower-level to run; platform realtime models are usually easier to ship, scale, and productionize today. Happy to do a small architecture comparison matrix if useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment