Skip to content

Instantly share code, notes, and snippets.

@am17an
Created November 30, 2025 10:42
Show Gist options
  • Select an option

  • Save am17an/6833d30230b877c93abf33ca12a07937 to your computer and use it in GitHub Desktop.

Select an option

Save am17an/6833d30230b877c93abf33ca12a07937 to your computer and use it in GitHub Desktop.
SoL calculation for gpt-oss for 5090
Model stats:
- Total parameters: 21B
- Active parameters per token: 3.6B
- Experts: 32 total, 4 active per layer
- Layers: 24
- Expert precision: 4-bit (0.5 bytes per parameter)
- Dense precision: BF16 (2 bytes per parameter)
1. Expert size:
Each expert is approximately 26 million parameters.
2. Active expert parameters per token:
24 layers * 4 experts per layer * 26M parameters
= 96 * 26M
= 2.496B
≈ 2.5B parameters
3. Dense parameters (attention + embeddings):
Total active = 3.6B
Expert portion = 2.5B
Dense parameters = 3.6B - 2.5B = 1.1B
4. Convert parameters to bytes:
Sparse (MoE) part, 4-bit precision:
2.5B parameters * 0.5 bytes = 1.25 GB
Dense part, BF16 precision:
1.1B parameters * 2 bytes = 2.20 GB
5. Total payload per token:
1.25 GB + 2.20 GB = 3.45 GB
6. Tokens per second (assuming 1800 GB/s bandwidth for 5090):
1800 GB/s / 3.45 GB per token ≈ 521 tokens per second
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment