Skip to content

Instantly share code, notes, and snippets.

@pszemraj
Last active July 3, 2025 21:02
Show Gist options
  • Save pszemraj/969879977cc34243ee8c8d2b42bcf330 to your computer and use it in GitHub Desktop.
Save pszemraj/969879977cc34243ee8c8d2b42bcf330 to your computer and use it in GitHub Desktop.
config reference for mosaicml/llm-foundry by opus-4

LLM Foundry Configuration Reference

This document provides a comprehensive reference for all configuration options available in LLM Foundry YAML files. Configuration files are used for training, fine-tuning, and evaluating large language models.

Table of Contents


Configuration Format Types

LLM Foundry uses several flexible input formats for configuration values. Understanding these formats is essential for properly configuring your training runs.

Duration Formats (str/int)

Many time-based parameters accept either strings with units or raw integers:

Format Description Example
"{N}ba" N batches "1000ba" = 1000 batches
"{N}ep" N epochs "3ep" = 3 epochs
"{X}dur" Fraction of total duration "0.1dur" = 10% of max_duration
"{N}tok" N tokens (for datasets) "10000tok" = 10,000 tokens
"{N}sp" N samples "5000sp" = 5000 samples
int Raw number (unit depends on context) 1000 = 1000 batches/steps

Examples:

max_duration: 10ep               # Train for 10 epochs
eval_interval: 500ba             # Evaluate every 500 batches
save_interval: 0.25dur           # Save 4 times during training
t_warmup: 100ba                  # Warmup for 100 batches

Microbatch Size (str/int/float)

The device_train_microbatch_size parameter has special handling:

Value Description
"auto" Automatically determine optimal microbatch size
int Fixed microbatch size
float Fraction of device batch size (e.g., 0.5 = half)

Examples:

device_train_microbatch_size: auto    # Let system optimize
device_train_microbatch_size: 4       # Fixed size of 4
device_train_microbatch_size: 0.25    # 1/4 of device batch size

Packing Ratio (float/str)

For sequence packing in fine-tuning datasets:

Value Description
"auto" Automatically determine optimal packing ratio
float Specific packing ratio (e.g., 2.5)
null No packing

Flexible Type Options

Some parameters accept multiple types for different behaviors:

logit_scale (float/str)

  • float: Fixed scaling value (e.g., 0.5)
  • "inv_sqrt_d_model": Scale by 1/sqrt(d_model)

init_div_is_residual (bool/float/str/int)

  • true/false: Enable/disable residual scaling
  • float: Custom scaling factor
  • String/int values for special initialization schemes

fc_type (str/dict)

  • str: Simple type name (e.g., "torch")
  • dict: Detailed configuration:
    fc_type:
      name: torch
      kwargs:
        bias: true

Variable Interpolation

Use ${variables.key} syntax to reference values defined in the variables section:

variables:
  base_lr: 3e-4
  data_path: /datasets/my_data

optimizer:
  lr: ${variables.base_lr}        # Resolves to 3e-4

train_loader:
  dataset:
    local: ${variables.data_path}  # Resolves to /datasets/my_data

Environment Variables

Reference environment variables with optional defaults:

run_name: ${RUN_NAME:default-run}     # Use $RUN_NAME or "default-run"
data_path: ${DATA_PATH}               # Use $DATA_PATH (error if not set)

Checkpoint Filename Templates

The save_filename and save_latest_filename parameters support template variables:

Variable Description Example Value
{epoch} Current epoch number 2
{batch} Current batch number 1000
{rank} Process rank 0
{timestamp} Unix timestamp 1609459200

Examples:

save_filename: "ep{epoch}-ba{batch}-rank{rank}.pt"      # Default format
save_filename: "checkpoint_{timestamp}.pt"               # Timestamp-based
save_latest_filename: "latest-rank{rank}.pt"             # Latest checkpoint

Variables

Variables allow you to define reusable values that can be interpolated throughout your configuration using ${variables.key} syntax.

variables:
  data_local: /path/to/local/data
  data_remote: s3://bucket/path
  max_seq_len: 2048
  global_seed: 42
  run_name: ${RUN_NAME:my-training-run}  # Can use env vars with defaults
Key Type Description
data_local str Local path for datasets
data_remote str Remote path for datasets (S3, OCI, etc.)
max_seq_len int Maximum sequence length used across configs
global_seed int Global random seed
run_name str Name for the training run
custom_vars any Any custom variables for interpolation

Model Configuration

Base Model Options

model:
  name: mpt_causal_lm  # or hf_causal_lm, hf_t5
  init_device: meta    # meta, cpu, cuda, or mixed
Key Type Default Description
name str required Model type: mpt_causal_lm, hf_causal_lm, hf_t5
init_device str cpu Device for model initialization (meta, cpu, cuda, mixed)

MPT Model Options

For name: mpt_causal_lm:

model:
  name: mpt_causal_lm
  d_model: 2048
  n_heads: 16
  n_layers: 24
  expansion_ratio: 4
  max_seq_len: 2048
  vocab_size: 50432
Key Type Default Description
d_model int required Model dimension
n_heads int required Number of attention heads
n_layers int required Number of transformer layers
expansion_ratio float 4 FFN expansion ratio
max_seq_len int 2048 Maximum sequence length
vocab_size int 50432 Vocabulary size
resid_pdrop float 0.0 Residual dropout probability
emb_pdrop float 0.0 Embedding dropout probability
learned_pos_emb bool True Use learned positional embeddings
tie_word_embeddings bool True Tie input/output embeddings
logit_scale float/str None Logit scaling (see Flexible Type Options)
no_bias bool False Disable all biases
attention_bias bool True Use bias in attention projections
embedding_fraction float 1.0 Fraction for embedding gradient scaling
norm_type str low_precision_layernorm Normalization type
norm_eps float 1e-5 Normalization epsilon
use_cache bool False Enable KV caching
use_pad_tok_in_ffn bool True Forward pad tokens through FFN
final_logit_softcapping float None Logit softcapping value

HuggingFace Model Options

For name: hf_causal_lm:

model:
  name: hf_causal_lm
  pretrained_model_name_or_path: meta-llama/Llama-2-7b-hf
  config_overrides:
    hidden_size: 4096
    intermediate_size: 11008
  use_auth_token: true
  trust_remote_code: true
  use_flash_attention_2: true
Key Type Default Description
pretrained_model_name_or_path str required HF model name or path
pretrained bool True Load pretrained weights
pretrained_lora_id_or_path str None Path to pretrained LoRA weights
config_overrides dict {} Override model config values
use_auth_token bool False Use HF auth token
trust_remote_code bool False Trust remote code
use_flash_attention_2 bool False Use Flash Attention 2
attn_implementation str None Override attention implementation
load_in_8bit bool False Load in 8-bit (eval only)
allow_embedding_resizing bool False Allow resizing embeddings

Attention Configuration

model:
  attn_config:
    attn_type: multihead_attention
    attn_impl: flash
    attn_pdrop: 0.0
    qk_ln: false
    clip_qkv: null
    softmax_scale: null
Key Type Default Description
attn_type str multihead_attention Attention type: multihead_attention, multiquery_attention, grouped_query_attention
attn_impl str flash Implementation: torch, flash, triton
attn_pdrop float 0.0 Attention dropout
qk_ln bool False Apply LayerNorm to queries/keys
qk_gn bool False Apply GroupNorm to queries/keys
clip_qkv float None Clip QKV values
softmax_scale float None Softmax temperature scaling
fused_qkv bool True Fuse QKV projections
attn_uses_sequence_id bool False Use sequence IDs (for packing)
sliding_window_size int -1 Sliding window attention size
alibi bool False Use ALiBi positional bias
alibi_bias_max int 8 Maximum ALiBi bias
rope bool False Use RoPE
rope_theta int 10000 RoPE theta parameter
rope_impl str hf RoPE implementation: hf or dail
rope_dail_config dict {} DAIL RoPE configuration
rope_hf_config dict {} HF RoPE scaling configuration
kv_n_heads int None Number of KV heads (for GQA)
reuse_kv_layer_idx int None Layer index for KV cache reuse
attn_temperature_tuning dict None Temperature tuning with floor_scale and attn_scale

FFN Configuration

model:
  ffn_config:
    ffn_type: mptmlp
Key Type Default Description
ffn_type str mptmlp FFN type: mptmlp, mptglu, te_ln_mlp, mb_dmoe
moe_num_experts int 1 Number of MoE experts
moe_top_k int 1 Top-k experts to use
moe_loss_weight float 0.01 MoE auxiliary loss weight
uniform_expert_assignment bool False Use uniform expert assignment

Initialization

model:
  init_config:
    name: default_
    init_std: 0.02
    init_div_is_residual: true
    emb_init_std: null
    fan_mode: fan_in
Key Type Default Description
name str default_ Init method: default_, baseline_, kaiming_uniform_, kaiming_normal_, xavier_uniform_, xavier_normal_
init_std float 0.02 Standard deviation for init
init_div_is_residual bool/float/str/int True Residual scaling (see Flexible Type Options)
emb_init_std float None Embedding init std
emb_init_uniform_lim float/list None Uniform init limits for embeddings
init_gain float 0 Gain for init
fan_mode str fan_in Fan mode: fan_in, fan_out, fan_avg
init_nonlinearity str relu Nonlinearity for init calculations

PEFT/LoRA Configuration

model:
  peft_config:
    peft_type: LORA
    task_type: CAUSAL_LM
    r: 16
    lora_alpha: 32
    lora_dropout: 0.1
    target_modules:
      - q_proj
      - v_proj
Key Type Default Description
peft_type str required PEFT type (currently only LORA)
task_type str required Task type: CAUSAL_LM
r int required LoRA rank
lora_alpha float required LoRA alpha (scaling parameter)
lora_dropout float 0.0 LoRA dropout
target_modules list required List of module names to apply LoRA to
should_save_peft_only bool True Save only PEFT weights

Advanced Model Options

Block Overrides

For custom architectures and per-layer configurations:

model:
  block_overrides:
    order: ['encoder_block', 'decoder_block']  # Custom block order
    overrides:
      layer_0:
        d_model: 1024
        n_heads: 8

FC Layer Type

model:
  fc_type: torch  # or 'te' for TransformerEngine
  # Or as a dict:
  fc_type:
    name: torch
    kwargs:
      bias: true

Activation Checkpointing Target (MPT Models)

For fine-grained control over which modules to checkpoint:

model:
  activation_checkpointing_target:
    # Option 1: Single module name
    'grouped_query_attention'
    
    # Option 2: List of modules
    ['grouped_query_attention', 'mptmlp']
    
    # Option 3: Dict with layer-specific targeting
    mptblock: 'last-16'                        # Checkpoint last 16 blocks
    grouped_query_attention: 'first-8, last-8' # First and last 8 layers
    mptmlp: 'middle-8'                         # Middle 8 layers
    norm: 'range-0-16'                         # Range of layers

Supported checkpoint targets:

  • mptblock - Entire transformer blocks
  • grouped_query_attention, multihead_attention, multiquery_attention - Attention layers
  • mptmlp, mptglu - FFN layers
  • norm_attn_norm - Fused norm-attention-norm blocks
  • te_ln_mlp - TransformerEngine MLP

Tokenizer Configuration

tokenizer:
  name: EleutherAI/gpt-neox-20b
  kwargs:
    model_max_length: 2048
    padding_side: left
    trust_remote_code: true
Key Type Default Description
name str required Tokenizer name or path
kwargs dict {} Additional tokenizer arguments
kwargs.model_max_length int None Maximum sequence length
kwargs.padding_side str left Padding side: left or right
kwargs.trust_remote_code bool False Trust remote code

Data Configuration

Training Data

train_loader:
  name: text
  dataset:
    local: ${variables.data_local}/train
    remote: ${variables.data_remote}/train
    split: train
    shuffle: true
    max_seq_len: ${variables.max_seq_len}
    shuffle_seed: ${variables.global_seed}
  drop_last: true
  num_workers: 8
  pin_memory: true
  prefetch_factor: 2
  persistent_workers: true

Evaluation Data

eval_loader:
  name: text
  dataset:
    local: ${variables.data_local}/val
    remote: ${variables.data_remote}/val
    split: val
    shuffle: false
    max_seq_len: ${variables.max_seq_len}
  drop_last: false
  num_workers: 8

# Multiple eval loaders
eval_loaders:
  - label: validation
    # ... loader config
  - label: test
    # ... loader config

Dataset Types

Text Dataset

For pretraining on streaming text data:

dataset:
  local: /path/to/data
  remote: s3://bucket/data
  split: train
  shuffle: true
  max_seq_len: 2048
  shuffle_seed: 42
  eos_token_id: 0
Key Type Default Description
local str required Local dataset path
remote str None Remote dataset path
split str required Dataset split
shuffle bool False Shuffle dataset
max_seq_len int required Maximum sequence length
shuffle_seed int 42 Shuffle seed
eos_token_id int None EOS token ID

Finetuning Dataset

For instruction tuning and finetuning:

dataset:
  hf_name: tatsu-lab/alpaca
  split: train
  max_seq_len: 2048
  decoder_only_format: true
  allow_pad_trimming: false
  packing_ratio: 2.5
  shuffle: true
Key Type Default Description
hf_name str required HuggingFace dataset name
preprocessing_fn str None Custom preprocessing function
safe_load bool False Safe load dataset
max_seq_len int required Maximum sequence length
decoder_only_format bool True Use decoder-only format
allow_pad_trimming bool False Allow trimming padded tokens
packing_ratio float/str None Packing ratio (see Packing Ratio)
prompt_delimiter str None Prompt delimiter
response_delimiter str None Response delimiter
target_prompts str none Target prompts: none, all, length
target_responses str last Target responses: last, all

Dataloader Options

Key Type Default Description
drop_last bool True Drop last incomplete batch
num_workers int 8 Number of dataloader workers
pin_memory bool True Pin memory for GPU transfer
prefetch_factor int 2 Prefetch factor
persistent_workers bool True Keep workers alive between epochs
timeout int 0 Worker timeout in seconds

Optimizer Configuration

optimizer:
  name: decoupled_adamw
  lr: 3.0e-4
  betas: [0.9, 0.999]
  eps: 1.0e-8
  weight_decay: 0.0

Available Optimizers

Optimizer Description
decoupled_adamw AdamW with decoupled weight decay
decoupled_lionw Lion optimizer with weight decay
adalr_lion Lion with adaptive learning rate
clip_lion Lion with gradient clipping
no_op No-op optimizer (for eval)

Common Parameters

Key Type Default Description
lr float required Learning rate
betas list[float] [0.9, 0.999] Beta parameters
eps float 1e-8 Epsilon for numerical stability
weight_decay float 0.0 Weight decay coefficient

Scheduler Configuration

scheduler:
  name: cosine_with_warmup
  t_warmup: 100ba
  alpha_f: 0.1

Available Schedulers

Scheduler Description
cosine_with_warmup Cosine annealing with warmup
linear_decay_with_warmup Linear decay with warmup
constant_with_warmup Constant LR with warmup
inv_sqrt_with_warmup Inverse square root with warmup

Common Parameters

Key Type Default Description
t_warmup str/int required Warmup duration (see Duration Formats)
alpha_f float 0.0 Final learning rate multiplier (for cosine/linear schedulers)

Inverse Square Root Scheduler Parameters

For name: inv_sqrt_with_warmup:

Key Type Default Description
t_scale float 1.0 Time scaling factor
t_cooldown str/int 0 Cooldown duration (see Duration Formats)
t_max str/int None Maximum time (see Duration Formats)
alpha_f_decay float 1.0 Final decay multiplier
alpha_f_cooldown float 0.0 Final cooldown multiplier

Training Configuration

# Duration and intervals
max_duration: 1ep              # or "1000ba", "100dur"
eval_interval: 100ba
eval_first: false
eval_subset_num_batches: -1

# Batch sizes
global_train_batch_size: 256
device_eval_batch_size: 8
device_train_microbatch_size: auto

# Training settings
seed: 42
precision: amp_bf16
accumulate_train_batch_on_tokens: true
Key Type Default Description
max_duration str/int required Training duration (see Duration Formats)
eval_interval str/int 1 Evaluation interval (see Duration Formats)
eval_first bool False Evaluate before training
eval_subset_num_batches int -1 Eval subset size (-1 for all)
global_train_batch_size int required Global batch size
device_train_batch_size int/float auto Per-device train batch size
device_eval_batch_size int required Per-device eval batch size
device_train_microbatch_size str/int/float auto Microbatch size (see Microbatch Size)
seed int required Random seed
precision str amp_bf16 Training precision
accumulate_train_batch_on_tokens bool False Accumulate gradients by tokens

Gradient Accumulation

Gradient accumulation is automatically configured based on batch sizes:

# Example: 8 GPUs, want effective batch size of 256
global_train_batch_size: 256        # Total batch size across all GPUs
device_train_microbatch_size: 8     # Microbatch per forward pass
# This results in:
# - device_train_batch_size = 256 / 8 = 32 per GPU
# - gradient_accumulation_steps = 32 / 8 = 4
Key Type Default Description
global_train_batch_size int required Total batch size across all devices
device_train_microbatch_size str/int/float auto Microbatch size per forward pass
accumulate_train_batch_on_tokens bool False Accumulate based on tokens (recommended for variable-length sequences)

Note: When accumulate_train_batch_on_tokens: true, gradient accumulation ensures consistent token counts across accumulation steps, which is important for models trained on variable-length sequences.

Precision Options

Value Description
amp_bf16 Automatic mixed precision with bfloat16
amp_fp16 Automatic mixed precision with float16
bf16 Pure bfloat16
fp16 Pure float16
fp32 Full precision
fp8 FP8 precision (requires hardware support)

Distributed Training

FSDP Configuration

fsdp_config:
  sharding_strategy: FULL_SHARD
  mixed_precision: DEFAULT
  activation_checkpointing: false
  activation_checkpointing_reentrant: false
  activation_cpu_offload: false
  limit_all_gathers: true
  use_orig_params: true
  verbose: false
Key Type Default Description
sharding_strategy str FULL_SHARD Sharding strategy: FULL_SHARD, SHARD_GRAD_OP, NO_SHARD
mixed_precision str DEFAULT Mixed precision: PURE, FULL, DEFAULT
activation_checkpointing bool False Enable activation checkpointing (trade compute for memory)
activation_checkpointing_reentrant bool False Use reentrant checkpointing (False recommended for better performance)
activation_cpu_offload bool False Offload activations to CPU (further memory savings at performance cost)
limit_all_gathers bool True Limit all-gather operations
use_orig_params bool True Use original parameters
state_dict_type str full State dict type: full, sharded
forward_prefetch bool False Enable forward prefetch
backward_prefetch str BACKWARD_PRE Backward prefetch: BACKWARD_PRE, BACKWARD_POST

Tensor Parallelism

tp_config:
  enabled: true
  tp_size: 2
  # Additional TP-specific configuration
Key Type Default Description
enabled bool False Enable tensor parallelism
tp_size int 1 Tensor parallel size

Logging and Monitoring

Console Logging

progress_bar: true
log_to_console: true
python_log_level: info
console_log_interval: 10ba
log_config: true
Key Type Default Description
progress_bar bool True Show progress bar
log_to_console bool True Log to console
python_log_level str info Python log level
console_log_interval str 1ba Console log interval (see Duration Formats)
log_config bool True Log configuration at start

Integration Loggers

Weights & Biases

loggers:
  wandb:
    project: my-project
    name: ${variables.run_name}
    group: experiment-1
    tags:
      - llm
      - training

MLflow

loggers:
  mlflow:
    tracking_uri: http://localhost:5000
    experiment_name: llm-training
    run_name: ${variables.run_name}

TensorBoard

loggers:
  tensorboard:
    log_dir: /logs/tensorboard
    flush_interval: 100

Callbacks

Standard Callbacks

callbacks:
  speed_monitor:
    window_size: 100
  lr_monitor: {}
  memory_monitor: {}
  runtime_estimator: {}
  optimizer_monitor: {}

Checkpointing Callbacks

callbacks:
  hf_checkpointer:
    save_folder: ${variables.save_folder}/hf_checkpoints
    save_interval: 1000ba
    precision: bfloat16
    overwrite: true
    
  mono_checkpoint_saver:
    save_folder: ${variables.save_folder}/checkpoints
    save_interval: 1000ba

Advanced Callbacks

Callback Description
kill_loss_spike Kill training on loss spikes
nan_monitor Monitor for NaN values
scheduled_gc Scheduled garbage collection
layer_freezing Progressive layer freezing
curriculum_learning Curriculum learning
eval_output_logging Log evaluation outputs
system_metrics_monitor Monitor system metrics

Checkpointing

# Saving configuration
save_folder: s3://bucket/checkpoints
save_interval: 1000ba
save_num_checkpoints_to_keep: 3
save_overwrite: false
save_weights_only: false

# Loading configuration
load_path: s3://bucket/checkpoints/latest
load_weights_only: false
load_strict_model_weights: true
load_ignore_keys:
  - optimizer
  - schedulers

# Auto-resumption
autoresume: true
Key Type Default Description
save_folder str None Checkpoint save location
save_interval str/int 1000ba Save interval (see Duration Formats)
save_num_checkpoints_to_keep int -1 Number of checkpoints to keep
save_overwrite bool False Overwrite existing checkpoints
save_weights_only bool False Save only model weights
save_filename str ep{epoch}-ba{batch}-rank{rank}.pt Checkpoint filename template
save_latest_filename str latest-rank{rank}.pt Latest checkpoint filename
load_path str None Path to load checkpoint
load_weights_only bool False Load only weights
load_strict_model_weights bool True Strict weight loading
autoresume bool False Auto-resume from latest
save_planner dict None FSDP save planner configuration
load_planner dict None FSDP load planner configuration

Evaluation

ICL Tasks

icl_tasks:
  - label: lambada
    dataset_uri: eval/local_data/lambada.jsonl
    num_fewshot: [0, 1, 5]
    icl_task_type: language_modeling
    continuation_delimiter: " "
Key Type Default Description
label str required Task label
dataset_uri str required Dataset location
num_fewshot list[int] [0] Few-shot examples
icl_task_type str required Task type
continuation_delimiter str " " Continuation delimiter

Eval Gauntlet

eval_gauntlet:
  weighting: uniform
  subtract_random_baseline: true
  rescale_accuracy: true
  categories:
    - name: commonsense_reasoning
      benchmarks:
        - name: hellaswag
          num_fewshot: 10
        - name: winogrande
          num_fewshot: 5
Key Type Default Description
weighting str uniform Weighting scheme
subtract_random_baseline bool True Subtract random baseline
rescale_accuracy bool True Rescale accuracy scores

System Configuration

# Memory settings
max_split_size_mb: 512
expandable_segments: true
cuda_load_lazy: false

# Distributed settings
dist_timeout: 600

# Code paths
code_paths:
  - /path/to/custom/code

# Compilation
compile_config:
  mode: max-autotune
  fullgraph: true
Key Type Default Description
max_split_size_mb int 512 CUDA memory allocator setting
expandable_segments bool True CUDA expandable segments
cuda_load_lazy bool False Lazy CUDA loading
dist_timeout float 600.0 Distributed timeout (seconds)
code_paths list [] Additional code paths
compile_config dict {} Torch compile configuration

Example Configurations

Minimal Pretraining Config

variables:
  max_seq_len: 2048
  global_seed: 42
  data_remote: s3://my-bucket/data

model:
  name: mpt_causal_lm
  d_model: 2048
  n_heads: 16
  n_layers: 24
  expansion_ratio: 4
  max_seq_len: ${variables.max_seq_len}
  vocab_size: 50432

tokenizer:
  name: EleutherAI/gpt-neox-20b

train_loader:
  name: text
  dataset:
    remote: ${variables.data_remote}/train
    split: train
    max_seq_len: ${variables.max_seq_len}
  drop_last: true
  num_workers: 8

optimizer:
  name: decoupled_adamw
  lr: 3e-4

scheduler:
  name: cosine_with_warmup
  t_warmup: 100ba

max_duration: 10000ba
global_train_batch_size: 256
device_eval_batch_size: 16
seed: ${variables.global_seed}

Fine-tuning Config

model:
  name: hf_causal_lm
  pretrained_model_name_or_path: meta-llama/Llama-2-7b-hf
  use_flash_attention_2: true
  
  # LoRA configuration
  peft_config:
    peft_type: LORA
    task_type: CAUSAL_LM
    r: 16
    lora_alpha: 32
    lora_dropout: 0.1
    target_modules:
      - q_proj
      - v_proj
      - k_proj
      - o_proj

train_loader:
  name: finetuning
  dataset:
    hf_name: tatsu-lab/alpaca
    split: train
    max_seq_len: 2048
    decoder_only_format: true
    packing_ratio: auto

optimizer:
  name: decoupled_adamw
  lr: 1e-4

max_duration: 3ep

Memory-Optimized Training Config

For training large models with limited GPU memory:

# 30B model on 8x A100 40GB
global_train_batch_size: 128
device_train_microbatch_size: 2  # Small microbatch for memory
accumulate_train_batch_on_tokens: true

fsdp_config:
  sharding_strategy: FULL_SHARD
  mixed_precision: PURE
  activation_checkpointing: true
  activation_checkpointing_reentrant: false
  activation_cpu_offload: false  # Set true if still OOM
  limit_all_gathers: true

model:
  # Fine-grained checkpointing for 30B model
  activation_checkpointing_target:
    mptblock: 'last-24'  # Checkpoint last 24 of 32 blocks
    grouped_query_attention: 'all'  # Checkpoint all attention

Maximum Memory Savings Config

For extreme memory constraints:

# 70B model training
global_train_batch_size: 16
device_train_microbatch_size: 1
accumulate_train_batch_on_tokens: true

fsdp_config:
  sharding_strategy: FULL_SHARD
  state_dict_type: sharded  # Save memory during checkpointing
  mixed_precision: PURE
  activation_checkpointing: true
  activation_checkpointing_reentrant: false
  activation_cpu_offload: true  # Maximum memory savings
  limit_all_gathers: true
  use_orig_params: true

model:
  activation_checkpointing_target: 'all'  # Checkpoint everything

This configuration reference covers all major options available in LLM Foundry. For the most up-to-date information, refer to the source code and example configurations in the repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment