You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This document provides a comprehensive reference for all configuration options available in LLM Foundry YAML files. Configuration files are used for training, fine-tuning, and evaluating large language models.
LLM Foundry uses several flexible input formats for configuration values. Understanding these formats is essential for properly configuring your training runs.
Duration Formats (str/int)
Many time-based parameters accept either strings with units or raw integers:
Format
Description
Example
"{N}ba"
N batches
"1000ba" = 1000 batches
"{N}ep"
N epochs
"3ep" = 3 epochs
"{X}dur"
Fraction of total duration
"0.1dur" = 10% of max_duration
"{N}tok"
N tokens (for datasets)
"10000tok" = 10,000 tokens
"{N}sp"
N samples
"5000sp" = 5000 samples
int
Raw number (unit depends on context)
1000 = 1000 batches/steps
Examples:
max_duration: 10ep # Train for 10 epochseval_interval: 500ba # Evaluate every 500 batchessave_interval: 0.25dur # Save 4 times during trainingt_warmup: 100ba # Warmup for 100 batches
Microbatch Size (str/int/float)
The device_train_microbatch_size parameter has special handling:
Value
Description
"auto"
Automatically determine optimal microbatch size
int
Fixed microbatch size
float
Fraction of device batch size (e.g., 0.5 = half)
Examples:
device_train_microbatch_size: auto # Let system optimizedevice_train_microbatch_size: 4# Fixed size of 4device_train_microbatch_size: 0.25# 1/4 of device batch size
Packing Ratio (float/str)
For sequence packing in fine-tuning datasets:
Value
Description
"auto"
Automatically determine optimal packing ratio
float
Specific packing ratio (e.g., 2.5)
null
No packing
Flexible Type Options
Some parameters accept multiple types for different behaviors:
logit_scale (float/str)
float: Fixed scaling value (e.g., 0.5)
"inv_sqrt_d_model": Scale by 1/sqrt(d_model)
init_div_is_residual (bool/float/str/int)
true/false: Enable/disable residual scaling
float: Custom scaling factor
String/int values for special initialization schemes
fc_type (str/dict)
str: Simple type name (e.g., "torch")
dict: Detailed configuration:
fc_type:
name: torchkwargs:
bias: true
Variable Interpolation
Use ${variables.key} syntax to reference values defined in the variables section:
variables:
base_lr: 3e-4data_path: /datasets/my_dataoptimizer:
lr: ${variables.base_lr} # Resolves to 3e-4train_loader:
dataset:
local: ${variables.data_path} # Resolves to /datasets/my_data
Environment Variables
Reference environment variables with optional defaults:
run_name: ${RUN_NAME:default-run} # Use $RUN_NAME or "default-run"data_path: ${DATA_PATH} # Use $DATA_PATH (error if not set)
Checkpoint Filename Templates
The save_filename and save_latest_filename parameters support template variables:
Variables allow you to define reusable values that can be interpolated throughout your configuration using ${variables.key} syntax.
variables:
data_local: /path/to/local/datadata_remote: s3://bucket/pathmax_seq_len: 2048global_seed: 42run_name: ${RUN_NAME:my-training-run} # Can use env vars with defaults
Key
Type
Description
data_local
str
Local path for datasets
data_remote
str
Remote path for datasets (S3, OCI, etc.)
max_seq_len
int
Maximum sequence length used across configs
global_seed
int
Global random seed
run_name
str
Name for the training run
custom_vars
any
Any custom variables for interpolation
Model Configuration
Base Model Options
model:
name: mpt_causal_lm # or hf_causal_lm, hf_t5init_device: meta # meta, cpu, cuda, or mixed
Key
Type
Default
Description
name
str
required
Model type: mpt_causal_lm, hf_causal_lm, hf_t5
init_device
str
cpu
Device for model initialization (meta, cpu, cuda, mixed)
model:
fc_type: torch # or 'te' for TransformerEngine# Or as a dict:fc_type:
name: torchkwargs:
bias: true
Activation Checkpointing Target (MPT Models)
For fine-grained control over which modules to checkpoint:
model:
activation_checkpointing_target:
# Option 1: Single module name'grouped_query_attention'# Option 2: List of modules['grouped_query_attention', 'mptmlp']# Option 3: Dict with layer-specific targetingmptblock: 'last-16'# Checkpoint last 16 blocksgrouped_query_attention: 'first-8, last-8'# First and last 8 layersmptmlp: 'middle-8'# Middle 8 layersnorm: 'range-0-16'# Range of layers
Gradient accumulation is automatically configured based on batch sizes:
# Example: 8 GPUs, want effective batch size of 256global_train_batch_size: 256# Total batch size across all GPUsdevice_train_microbatch_size: 8# Microbatch per forward pass# This results in:# - device_train_batch_size = 256 / 8 = 32 per GPU# - gradient_accumulation_steps = 32 / 8 = 4
Key
Type
Default
Description
global_train_batch_size
int
required
Total batch size across all devices
device_train_microbatch_size
str/int/float
auto
Microbatch size per forward pass
accumulate_train_batch_on_tokens
bool
False
Accumulate based on tokens (recommended for variable-length sequences)
Note: When accumulate_train_batch_on_tokens: true, gradient accumulation ensures consistent token counts across accumulation steps, which is important for models trained on variable-length sequences.
For training large models with limited GPU memory:
# 30B model on 8x A100 40GBglobal_train_batch_size: 128device_train_microbatch_size: 2# Small microbatch for memoryaccumulate_train_batch_on_tokens: truefsdp_config:
sharding_strategy: FULL_SHARDmixed_precision: PUREactivation_checkpointing: trueactivation_checkpointing_reentrant: falseactivation_cpu_offload: false # Set true if still OOMlimit_all_gathers: truemodel:
# Fine-grained checkpointing for 30B modelactivation_checkpointing_target:
mptblock: 'last-24'# Checkpoint last 24 of 32 blocksgrouped_query_attention: 'all'# Checkpoint all attention
Maximum Memory Savings Config
For extreme memory constraints:
# 70B model trainingglobal_train_batch_size: 16device_train_microbatch_size: 1accumulate_train_batch_on_tokens: truefsdp_config:
sharding_strategy: FULL_SHARDstate_dict_type: sharded # Save memory during checkpointingmixed_precision: PUREactivation_checkpointing: trueactivation_checkpointing_reentrant: falseactivation_cpu_offload: true # Maximum memory savingslimit_all_gathers: trueuse_orig_params: truemodel:
activation_checkpointing_target: 'all'# Checkpoint everything
This configuration reference covers all major options available in LLM Foundry. For the most up-to-date information, refer to the source code and example configurations in the repository.