LLM Foundry Configuration Reference

This document provides a comprehensive reference for all configuration options available in LLM Foundry YAML files. Configuration files are used for training, fine-tuning, and evaluating large language models.

Configuration Format Types
Variables
Model Configuration
Tokenizer Configuration
Data Configuration
Optimizer Configuration
Scheduler Configuration
Training Configuration
Distributed Training
- FSDP Configuration
- Tensor Parallelism
Logging and Monitoring
- Console Logging
- Integration Loggers
Callbacks
Checkpointing
Evaluation
- ICL Tasks
- Eval Gauntlet
System Configuration

Configuration Format Types

LLM Foundry uses several flexible input formats for configuration values. Understanding these formats is essential for properly configuring your training runs.

Duration Formats (str/int)

Many time-based parameters accept either strings with units or raw integers:

Format	Description	Example
`"{N}ba"`	N batches	`"1000ba"` = 1000 batches
`"{N}ep"`	N epochs	`"3ep"` = 3 epochs
`"{X}dur"`	Fraction of total duration	`"0.1dur"` = 10% of max_duration
`"{N}tok"`	N tokens (for datasets)	`"10000tok"` = 10,000 tokens
`"{N}sp"`	N samples	`"5000sp"` = 5000 samples
`int`	Raw number (unit depends on context)	`1000` = 1000 batches/steps

Examples:

max_duration: 10ep               # Train for 10 epochs
eval_interval: 500ba             # Evaluate every 500 batches
save_interval: 0.25dur           # Save 4 times during training
t_warmup: 100ba                  # Warmup for 100 batches

Microbatch Size (str/int/float)

The device_train_microbatch_size parameter has special handling:

Value	Description
`"auto"`	Automatically determine optimal microbatch size
`int`	Fixed microbatch size
`float`	Fraction of device batch size (e.g., 0.5 = half)

Examples:

device_train_microbatch_size: auto    # Let system optimize
device_train_microbatch_size: 4       # Fixed size of 4
device_train_microbatch_size: 0.25    # 1/4 of device batch size

Packing Ratio (float/str)

For sequence packing in fine-tuning datasets:

Value	Description
`"auto"`	Automatically determine optimal packing ratio
`float`	Specific packing ratio (e.g., 2.5)
`null`	No packing

Flexible Type Options

Some parameters accept multiple types for different behaviors:

`logit_scale` (float/str)

float: Fixed scaling value (e.g., 0.5)
"inv_sqrt_d_model": Scale by 1/sqrt(d_model)

`init_div_is_residual` (bool/float/str/int)

true/false: Enable/disable residual scaling
float: Custom scaling factor
String/int values for special initialization schemes

`fc_type` (str/dict)

str: Simple type name (e.g., "torch")

dict: Detailed configuration:

fc_type:
  name: torch
  kwargs:
    bias: true

Variable Interpolation

Use ${variables.key} syntax to reference values defined in the variables section:

variables:
  base_lr: 3e-4
  data_path: /datasets/my_data

optimizer:
  lr: ${variables.base_lr}        # Resolves to 3e-4

train_loader:
  dataset:
    local: ${variables.data_path}  # Resolves to /datasets/my_data

Environment Variables

Reference environment variables with optional defaults:

run_name: ${RUN_NAME:default-run}     # Use $RUN_NAME or "default-run"
data_path: ${DATA_PATH}               # Use $DATA_PATH (error if not set)

Checkpoint Filename Templates

The save_filename and save_latest_filename parameters support template variables:

Variable	Description	Example Value
`{epoch}`	Current epoch number	`2`
`{batch}`	Current batch number	`1000`
`{rank}`	Process rank	`0`
`{timestamp}`	Unix timestamp	`1609459200`

Examples:

save_filename: "ep{epoch}-ba{batch}-rank{rank}.pt"      # Default format
save_filename: "checkpoint_{timestamp}.pt"               # Timestamp-based
save_latest_filename: "latest-rank{rank}.pt"             # Latest checkpoint

Variables

Variables allow you to define reusable values that can be interpolated throughout your configuration using ${variables.key} syntax.

variables:
  data_local: /path/to/local/data
  data_remote: s3://bucket/path
  max_seq_len: 2048
  global_seed: 42
  run_name: ${RUN_NAME:my-training-run}  # Can use env vars with defaults

Key	Type	Description
`data_local`	str	Local path for datasets
`data_remote`	str	Remote path for datasets (S3, OCI, etc.)
`max_seq_len`	int	Maximum sequence length used across configs
`global_seed`	int	Global random seed
`run_name`	str	Name for the training run
custom_vars	any	Any custom variables for interpolation

Model Configuration

Base Model Options

model:
  name: mpt_causal_lm  # or hf_causal_lm, hf_t5
  init_device: meta    # meta, cpu, cuda, or mixed

Key	Type	Default	Description
`name`	str	required	Model type: `mpt_causal_lm`, `hf_causal_lm`, `hf_t5`
`init_device`	str	`cpu`	Device for model initialization (`meta`, `cpu`, `cuda`, `mixed`)

MPT Model Options

For name: mpt_causal_lm:

model:
  name: mpt_causal_lm
  d_model: 2048
  n_heads: 16
  n_layers: 24
  expansion_ratio: 4
  max_seq_len: 2048
  vocab_size: 50432

Key	Type	Default	Description
`d_model`	int	required	Model dimension
`n_heads`	int	required	Number of attention heads
`n_layers`	int	required	Number of transformer layers
`expansion_ratio`	float	4	FFN expansion ratio
`max_seq_len`	int	2048	Maximum sequence length
`vocab_size`	int	50432	Vocabulary size
`resid_pdrop`	float	0.0	Residual dropout probability
`emb_pdrop`	float	0.0	Embedding dropout probability
`learned_pos_emb`	bool	True	Use learned positional embeddings
`tie_word_embeddings`	bool	True	Tie input/output embeddings
`logit_scale`	float/str	None	Logit scaling (see Flexible Type Options)
`no_bias`	bool	False	Disable all biases
`attention_bias`	bool	True	Use bias in attention projections
`embedding_fraction`	float	1.0	Fraction for embedding gradient scaling
`norm_type`	str	`low_precision_layernorm`	Normalization type
`norm_eps`	float	1e-5	Normalization epsilon
`use_cache`	bool	False	Enable KV caching
`use_pad_tok_in_ffn`	bool	True	Forward pad tokens through FFN
`final_logit_softcapping`	float	None	Logit softcapping value

HuggingFace Model Options

For name: hf_causal_lm:

model:
  name: hf_causal_lm
  pretrained_model_name_or_path: meta-llama/Llama-2-7b-hf
  config_overrides:
    hidden_size: 4096
    intermediate_size: 11008
  use_auth_token: true
  trust_remote_code: true
  use_flash_attention_2: true

Key	Type	Default	Description
`pretrained_model_name_or_path`	str	required	HF model name or path
`pretrained`	bool	True	Load pretrained weights
`pretrained_lora_id_or_path`	str	None	Path to pretrained LoRA weights
`config_overrides`	dict	{}	Override model config values
`use_auth_token`	bool	False	Use HF auth token
`trust_remote_code`	bool	False	Trust remote code
`use_flash_attention_2`	bool	False	Use Flash Attention 2
`attn_implementation`	str	None	Override attention implementation
`load_in_8bit`	bool	False	Load in 8-bit (eval only)
`allow_embedding_resizing`	bool	False	Allow resizing embeddings

Attention Configuration

model:
  attn_config:
    attn_type: multihead_attention
    attn_impl: flash
    attn_pdrop: 0.0
    qk_ln: false
    clip_qkv: null
    softmax_scale: null

Key	Type	Default	Description
`attn_type`	str	`multihead_attention`	Attention type: `multihead_attention`, `multiquery_attention`, `grouped_query_attention`
`attn_impl`	str	`flash`	Implementation: `torch`, `flash`, `triton`
`attn_pdrop`	float	0.0	Attention dropout
`qk_ln`	bool	False	Apply LayerNorm to queries/keys
`qk_gn`	bool	False	Apply GroupNorm to queries/keys
`clip_qkv`	float	None	Clip QKV values
`softmax_scale`	float	None	Softmax temperature scaling
`fused_qkv`	bool	True	Fuse QKV projections
`attn_uses_sequence_id`	bool	False	Use sequence IDs (for packing)
`sliding_window_size`	int	-1	Sliding window attention size
`alibi`	bool	False	Use ALiBi positional bias
`alibi_bias_max`	int	8	Maximum ALiBi bias
`rope`	bool	False	Use RoPE
`rope_theta`	int	10000	RoPE theta parameter
`rope_impl`	str	`hf`	RoPE implementation: `hf` or `dail`
`rope_dail_config`	dict	{}	DAIL RoPE configuration
`rope_hf_config`	dict	{}	HF RoPE scaling configuration
`kv_n_heads`	int	None	Number of KV heads (for GQA)
`reuse_kv_layer_idx`	int	None	Layer index for KV cache reuse
`attn_temperature_tuning`	dict	None	Temperature tuning with `floor_scale` and `attn_scale`

FFN Configuration

model:
  ffn_config:
    ffn_type: mptmlp

Key	Type	Default	Description
`ffn_type`	str	`mptmlp`	FFN type: `mptmlp`, `mptglu`, `te_ln_mlp`, `mb_dmoe`
`moe_num_experts`	int	1	Number of MoE experts
`moe_top_k`	int	1	Top-k experts to use
`moe_loss_weight`	float	0.01	MoE auxiliary loss weight
`uniform_expert_assignment`	bool	False	Use uniform expert assignment

Initialization

model:
  init_config:
    name: default_
    init_std: 0.02
    init_div_is_residual: true
    emb_init_std: null
    fan_mode: fan_in

Key	Type	Default	Description
`name`	str	`default_`	Init method: `default_`, `baseline_`, `kaiming_uniform_`, `kaiming_normal_`, `xavier_uniform_`, `xavier_normal_`
`init_std`	float	0.02	Standard deviation for init
`init_div_is_residual`	bool/float/str/int	True	Residual scaling (see Flexible Type Options)
`emb_init_std`	float	None	Embedding init std
`emb_init_uniform_lim`	float/list	None	Uniform init limits for embeddings
`init_gain`	float	0	Gain for init
`fan_mode`	str	`fan_in`	Fan mode: `fan_in`, `fan_out`, `fan_avg`
`init_nonlinearity`	str	`relu`	Nonlinearity for init calculations

PEFT/LoRA Configuration

model:
  peft_config:
    peft_type: LORA
    task_type: CAUSAL_LM
    r: 16
    lora_alpha: 32
    lora_dropout: 0.1
    target_modules:
      - q_proj
      - v_proj

Key	Type	Default	Description
`peft_type`	str	required	PEFT type (currently only `LORA`)
`task_type`	str	required	Task type: `CAUSAL_LM`
`r`	int	required	LoRA rank
`lora_alpha`	float	required	LoRA alpha (scaling parameter)
`lora_dropout`	float	0.0	LoRA dropout
`target_modules`	list	required	List of module names to apply LoRA to
`should_save_peft_only`	bool	True	Save only PEFT weights

Advanced Model Options

Block Overrides

For custom architectures and per-layer configurations:

model:
  block_overrides:
    order: ['encoder_block', 'decoder_block']  # Custom block order
    overrides:
      layer_0:
        d_model: 1024
        n_heads: 8

FC Layer Type

model:
  fc_type: torch  # or 'te' for TransformerEngine
  # Or as a dict:
  fc_type:
    name: torch
    kwargs:
      bias: true

Activation Checkpointing Target (MPT Models)

For fine-grained control over which modules to checkpoint:

model:
  activation_checkpointing_target:
    # Option 1: Single module name
    'grouped_query_attention'
    
    # Option 2: List of modules
    ['grouped_query_attention', 'mptmlp']
    
    # Option 3: Dict with layer-specific targeting
    mptblock: 'last-16'                        # Checkpoint last 16 blocks
    grouped_query_attention: 'first-8, last-8' # First and last 8 layers
    mptmlp: 'middle-8'                         # Middle 8 layers
    norm: 'range-0-16'                         # Range of layers

Supported checkpoint targets:

mptblock - Entire transformer blocks
grouped_query_attention, multihead_attention, multiquery_attention - Attention layers
mptmlp, mptglu - FFN layers
norm_attn_norm - Fused norm-attention-norm blocks
te_ln_mlp - TransformerEngine MLP

Tokenizer Configuration

tokenizer:
  name: EleutherAI/gpt-neox-20b
  kwargs:
    model_max_length: 2048
    padding_side: left
    trust_remote_code: true

Key	Type	Default	Description
`name`	str	required	Tokenizer name or path
`kwargs`	dict	{}	Additional tokenizer arguments
`kwargs.model_max_length`	int	None	Maximum sequence length
`kwargs.padding_side`	str	`left`	Padding side: `left` or `right`
`kwargs.trust_remote_code`	bool	False	Trust remote code

Data Configuration

Training Data

train_loader:
  name: text
  dataset:
    local: ${variables.data_local}/train
    remote: ${variables.data_remote}/train
    split: train
    shuffle: true
    max_seq_len: ${variables.max_seq_len}
    shuffle_seed: ${variables.global_seed}
  drop_last: true
  num_workers: 8
  pin_memory: true
  prefetch_factor: 2
  persistent_workers: true

Evaluation Data

eval_loader:
  name: text
  dataset:
    local: ${variables.data_local}/val
    remote: ${variables.data_remote}/val
    split: val
    shuffle: false
    max_seq_len: ${variables.max_seq_len}
  drop_last: false
  num_workers: 8

# Multiple eval loaders
eval_loaders:
  - label: validation
    # ... loader config
  - label: test
    # ... loader config

Dataset Types

Text Dataset

For pretraining on streaming text data:

dataset:
  local: /path/to/data
  remote: s3://bucket/data
  split: train
  shuffle: true
  max_seq_len: 2048
  shuffle_seed: 42
  eos_token_id: 0

Key	Type	Default	Description
`local`	str	required	Local dataset path
`remote`	str	None	Remote dataset path
`split`	str	required	Dataset split
`shuffle`	bool	False	Shuffle dataset
`max_seq_len`	int	required	Maximum sequence length
`shuffle_seed`	int	42	Shuffle seed
`eos_token_id`	int	None	EOS token ID

Finetuning Dataset

For instruction tuning and finetuning:

dataset:
  hf_name: tatsu-lab/alpaca
  split: train
  max_seq_len: 2048
  decoder_only_format: true
  allow_pad_trimming: false
  packing_ratio: 2.5
  shuffle: true

Key	Type	Default	Description
`hf_name`	str	required	HuggingFace dataset name
`preprocessing_fn`	str	None	Custom preprocessing function
`safe_load`	bool	False	Safe load dataset
`max_seq_len`	int	required	Maximum sequence length
`decoder_only_format`	bool	True	Use decoder-only format
`allow_pad_trimming`	bool	False	Allow trimming padded tokens
`packing_ratio`	float/str	None	Packing ratio (see Packing Ratio)
`prompt_delimiter`	str	None	Prompt delimiter
`response_delimiter`	str	None	Response delimiter
`target_prompts`	str	`none`	Target prompts: `none`, `all`, `length`
`target_responses`	str	`last`	Target responses: `last`, `all`

Dataloader Options

Key	Type	Default	Description
`drop_last`	bool	True	Drop last incomplete batch
`num_workers`	int	8	Number of dataloader workers
`pin_memory`	bool	True	Pin memory for GPU transfer
`prefetch_factor`	int	2	Prefetch factor
`persistent_workers`	bool	True	Keep workers alive between epochs
`timeout`	int	0	Worker timeout in seconds

Optimizer Configuration

optimizer:
  name: decoupled_adamw
  lr: 3.0e-4
  betas: [0.9, 0.999]
  eps: 1.0e-8
  weight_decay: 0.0

Available Optimizers

Optimizer	Description
`decoupled_adamw`	AdamW with decoupled weight decay
`decoupled_lionw`	Lion optimizer with weight decay
`adalr_lion`	Lion with adaptive learning rate
`clip_lion`	Lion with gradient clipping
`no_op`	No-op optimizer (for eval)

Common Parameters

Key	Type	Default	Description
`lr`	float	required	Learning rate
`betas`	list[float]	[0.9, 0.999]	Beta parameters
`eps`	float	1e-8	Epsilon for numerical stability
`weight_decay`	float	0.0	Weight decay coefficient

Scheduler Configuration

scheduler:
  name: cosine_with_warmup
  t_warmup: 100ba
  alpha_f: 0.1

Available Schedulers

Scheduler	Description
`cosine_with_warmup`	Cosine annealing with warmup
`linear_decay_with_warmup`	Linear decay with warmup
`constant_with_warmup`	Constant LR with warmup
`inv_sqrt_with_warmup`	Inverse square root with warmup

Common Parameters

Key	Type	Default	Description
`t_warmup`	str/int	required	Warmup duration (see Duration Formats)
`alpha_f`	float	0.0	Final learning rate multiplier (for cosine/linear schedulers)

Inverse Square Root Scheduler Parameters

For name: inv_sqrt_with_warmup:

Key	Type	Default	Description
`t_scale`	float	1.0	Time scaling factor
`t_cooldown`	str/int	0	Cooldown duration (see Duration Formats)
`t_max`	str/int	None	Maximum time (see Duration Formats)
`alpha_f_decay`	float	1.0	Final decay multiplier
`alpha_f_cooldown`	float	0.0	Final cooldown multiplier

Training Configuration

# Duration and intervals
max_duration: 1ep              # or "1000ba", "100dur"
eval_interval: 100ba
eval_first: false
eval_subset_num_batches: -1

# Batch sizes
global_train_batch_size: 256
device_eval_batch_size: 8
device_train_microbatch_size: auto

# Training settings
seed: 42
precision: amp_bf16
accumulate_train_batch_on_tokens: true

Key	Type	Default	Description
`max_duration`	str/int	required	Training duration (see Duration Formats)
`eval_interval`	str/int	1	Evaluation interval (see Duration Formats)
`eval_first`	bool	False	Evaluate before training
`eval_subset_num_batches`	int	-1	Eval subset size (-1 for all)
`global_train_batch_size`	int	required	Global batch size
`device_train_batch_size`	int/float	auto	Per-device train batch size
`device_eval_batch_size`	int	required	Per-device eval batch size
`device_train_microbatch_size`	str/int/float	`auto`	Microbatch size (see Microbatch Size)
`seed`	int	required	Random seed
`precision`	str	`amp_bf16`	Training precision
`accumulate_train_batch_on_tokens`	bool	False	Accumulate gradients by tokens

Gradient Accumulation

Gradient accumulation is automatically configured based on batch sizes:

# Example: 8 GPUs, want effective batch size of 256
global_train_batch_size: 256        # Total batch size across all GPUs
device_train_microbatch_size: 8     # Microbatch per forward pass
# This results in:
# - device_train_batch_size = 256 / 8 = 32 per GPU
# - gradient_accumulation_steps = 32 / 8 = 4

Key	Type	Default	Description
`global_train_batch_size`	int	required	Total batch size across all devices
`device_train_microbatch_size`	str/int/float	`auto`	Microbatch size per forward pass
`accumulate_train_batch_on_tokens`	bool	False	Accumulate based on tokens (recommended for variable-length sequences)

Note: When accumulate_train_batch_on_tokens: true, gradient accumulation ensures consistent token counts across accumulation steps, which is important for models trained on variable-length sequences.

Precision Options

Value	Description
`amp_bf16`	Automatic mixed precision with bfloat16
`amp_fp16`	Automatic mixed precision with float16
`bf16`	Pure bfloat16
`fp16`	Pure float16
`fp32`	Full precision
`fp8`	FP8 precision (requires hardware support)

Distributed Training

FSDP Configuration

fsdp_config:
  sharding_strategy: FULL_SHARD
  mixed_precision: DEFAULT
  activation_checkpointing: false
  activation_checkpointing_reentrant: false
  activation_cpu_offload: false
  limit_all_gathers: true
  use_orig_params: true
  verbose: false

Key	Type	Default	Description
`sharding_strategy`	str	`FULL_SHARD`	Sharding strategy: `FULL_SHARD`, `SHARD_GRAD_OP`, `NO_SHARD`
`mixed_precision`	str	`DEFAULT`	Mixed precision: `PURE`, `FULL`, `DEFAULT`
`activation_checkpointing`	bool	False	Enable activation checkpointing (trade compute for memory)
`activation_checkpointing_reentrant`	bool	False	Use reentrant checkpointing (False recommended for better performance)
`activation_cpu_offload`	bool	False	Offload activations to CPU (further memory savings at performance cost)
`limit_all_gathers`	bool	True	Limit all-gather operations
`use_orig_params`	bool	True	Use original parameters
`state_dict_type`	str	`full`	State dict type: `full`, `sharded`
`forward_prefetch`	bool	False	Enable forward prefetch
`backward_prefetch`	str	`BACKWARD_PRE`	Backward prefetch: `BACKWARD_PRE`, `BACKWARD_POST`

Tensor Parallelism

tp_config:
  enabled: true
  tp_size: 2
  # Additional TP-specific configuration

Key	Type	Default	Description
`enabled`	bool	False	Enable tensor parallelism
`tp_size`	int	1	Tensor parallel size

Logging and Monitoring

Console Logging

progress_bar: true
log_to_console: true
python_log_level: info
console_log_interval: 10ba
log_config: true

Key	Type	Default	Description
`progress_bar`	bool	True	Show progress bar
`log_to_console`	bool	True	Log to console
`python_log_level`	str	`info`	Python log level
`console_log_interval`	str	`1ba`	Console log interval (see Duration Formats)
`log_config`	bool	True	Log configuration at start

Integration Loggers

Weights & Biases

loggers:
  wandb:
    project: my-project
    name: ${variables.run_name}
    group: experiment-1
    tags:
      - llm
      - training

MLflow

loggers:
  mlflow:
    tracking_uri: http://localhost:5000
    experiment_name: llm-training
    run_name: ${variables.run_name}

TensorBoard

loggers:
  tensorboard:
    log_dir: /logs/tensorboard
    flush_interval: 100

Callbacks

Standard Callbacks

callbacks:
  speed_monitor:
    window_size: 100
  lr_monitor: {}
  memory_monitor: {}
  runtime_estimator: {}
  optimizer_monitor: {}

Checkpointing Callbacks

callbacks:
  hf_checkpointer:
    save_folder: ${variables.save_folder}/hf_checkpoints
    save_interval: 1000ba
    precision: bfloat16
    overwrite: true
    
  mono_checkpoint_saver:
    save_folder: ${variables.save_folder}/checkpoints
    save_interval: 1000ba

Advanced Callbacks

Callback	Description
`kill_loss_spike`	Kill training on loss spikes
`nan_monitor`	Monitor for NaN values
`scheduled_gc`	Scheduled garbage collection
`layer_freezing`	Progressive layer freezing
`curriculum_learning`	Curriculum learning
`eval_output_logging`	Log evaluation outputs
`system_metrics_monitor`	Monitor system metrics

Checkpointing

# Saving configuration
save_folder: s3://bucket/checkpoints
save_interval: 1000ba
save_num_checkpoints_to_keep: 3
save_overwrite: false
save_weights_only: false

# Loading configuration
load_path: s3://bucket/checkpoints/latest
load_weights_only: false
load_strict_model_weights: true
load_ignore_keys:
  - optimizer
  - schedulers

# Auto-resumption
autoresume: true

Key	Type	Default	Description
`save_folder`	str	None	Checkpoint save location
`save_interval`	str/int	1000ba	Save interval (see Duration Formats)
`save_num_checkpoints_to_keep`	int	-1	Number of checkpoints to keep
`save_overwrite`	bool	False	Overwrite existing checkpoints
`save_weights_only`	bool	False	Save only model weights
`save_filename`	str	`ep{epoch}-ba{batch}-rank{rank}.pt`	Checkpoint filename template
`save_latest_filename`	str	`latest-rank{rank}.pt`	Latest checkpoint filename
`load_path`	str	None	Path to load checkpoint
`load_weights_only`	bool	False	Load only weights
`load_strict_model_weights`	bool	True	Strict weight loading
`autoresume`	bool	False	Auto-resume from latest
`save_planner`	dict	None	FSDP save planner configuration
`load_planner`	dict	None	FSDP load planner configuration

Evaluation

ICL Tasks

icl_tasks:
  - label: lambada
    dataset_uri: eval/local_data/lambada.jsonl
    num_fewshot: [0, 1, 5]
    icl_task_type: language_modeling
    continuation_delimiter: " "

Key	Type	Default	Description
`label`	str	required	Task label
`dataset_uri`	str	required	Dataset location
`num_fewshot`	list[int]	[0]	Few-shot examples
`icl_task_type`	str	required	Task type
`continuation_delimiter`	str	`" "`	Continuation delimiter

Eval Gauntlet

eval_gauntlet:
  weighting: uniform
  subtract_random_baseline: true
  rescale_accuracy: true
  categories:
    - name: commonsense_reasoning
      benchmarks:
        - name: hellaswag
          num_fewshot: 10
        - name: winogrande
          num_fewshot: 5

Key	Type	Default	Description
`weighting`	str	`uniform`	Weighting scheme
`subtract_random_baseline`	bool	True	Subtract random baseline
`rescale_accuracy`	bool	True	Rescale accuracy scores

System Configuration

# Memory settings
max_split_size_mb: 512
expandable_segments: true
cuda_load_lazy: false

# Distributed settings
dist_timeout: 600

# Code paths
code_paths:
  - /path/to/custom/code

# Compilation
compile_config:
  mode: max-autotune
  fullgraph: true

Key	Type	Default	Description
`max_split_size_mb`	int	512	CUDA memory allocator setting
`expandable_segments`	bool	True	CUDA expandable segments
`cuda_load_lazy`	bool	False	Lazy CUDA loading
`dist_timeout`	float	600.0	Distributed timeout (seconds)
`code_paths`	list	[]	Additional code paths
`compile_config`	dict	{}	Torch compile configuration

Example Configurations

Minimal Pretraining Config

variables:
  max_seq_len: 2048
  global_seed: 42
  data_remote: s3://my-bucket/data

model:
  name: mpt_causal_lm
  d_model: 2048
  n_heads: 16
  n_layers: 24
  expansion_ratio: 4
  max_seq_len: ${variables.max_seq_len}
  vocab_size: 50432

tokenizer:
  name: EleutherAI/gpt-neox-20b

train_loader:
  name: text
  dataset:
    remote: ${variables.data_remote}/train
    split: train
    max_seq_len: ${variables.max_seq_len}
  drop_last: true
  num_workers: 8

optimizer:
  name: decoupled_adamw
  lr: 3e-4

scheduler:
  name: cosine_with_warmup
  t_warmup: 100ba

max_duration: 10000ba
global_train_batch_size: 256
device_eval_batch_size: 16
seed: ${variables.global_seed}

Fine-tuning Config

model:
  name: hf_causal_lm
  pretrained_model_name_or_path: meta-llama/Llama-2-7b-hf
  use_flash_attention_2: true
  
  # LoRA configuration
  peft_config:
    peft_type: LORA
    task_type: CAUSAL_LM
    r: 16
    lora_alpha: 32
    lora_dropout: 0.1
    target_modules:
      - q_proj
      - v_proj
      - k_proj
      - o_proj

train_loader:
  name: finetuning
  dataset:
    hf_name: tatsu-lab/alpaca
    split: train
    max_seq_len: 2048
    decoder_only_format: true
    packing_ratio: auto

optimizer:
  name: decoupled_adamw
  lr: 1e-4

max_duration: 3ep

Memory-Optimized Training Config

For training large models with limited GPU memory:

# 30B model on 8x A100 40GB
global_train_batch_size: 128
device_train_microbatch_size: 2  # Small microbatch for memory
accumulate_train_batch_on_tokens: true

fsdp_config:
  sharding_strategy: FULL_SHARD
  mixed_precision: PURE
  activation_checkpointing: true
  activation_checkpointing_reentrant: false
  activation_cpu_offload: false  # Set true if still OOM
  limit_all_gathers: true

model:
  # Fine-grained checkpointing for 30B model
  activation_checkpointing_target:
    mptblock: 'last-24'  # Checkpoint last 24 of 32 blocks
    grouped_query_attention: 'all'  # Checkpoint all attention

Maximum Memory Savings Config

For extreme memory constraints:

# 70B model training
global_train_batch_size: 16
device_train_microbatch_size: 1
accumulate_train_batch_on_tokens: true

fsdp_config:
  sharding_strategy: FULL_SHARD
  state_dict_type: sharded  # Save memory during checkpointing
  mixed_precision: PURE
  activation_checkpointing: true
  activation_checkpointing_reentrant: false
  activation_cpu_offload: true  # Maximum memory savings
  limit_all_gathers: true
  use_orig_params: true

model:
  activation_checkpointing_target: 'all'  # Checkpoint everything

This configuration reference covers all major options available in LLM Foundry. For the most up-to-date information, refer to the source code and example configurations in the repository.

pszemraj/llm-foundry-config-reference.md

LLM Foundry Configuration Reference

Table of Contents

Configuration Format Types

Duration Formats (str/int)

Microbatch Size (str/int/float)

Packing Ratio (float/str)

Flexible Type Options

logit_scale (float/str)

init_div_is_residual (bool/float/str/int)

fc_type (str/dict)

Variable Interpolation

Environment Variables

Checkpoint Filename Templates

Variables

Model Configuration

Base Model Options

MPT Model Options

HuggingFace Model Options

Attention Configuration

FFN Configuration

Initialization

PEFT/LoRA Configuration

Advanced Model Options

Block Overrides

FC Layer Type

Activation Checkpointing Target (MPT Models)

Tokenizer Configuration

Data Configuration

Training Data

Evaluation Data

Dataset Types

Text Dataset

Finetuning Dataset

Dataloader Options

Optimizer Configuration

Available Optimizers

Common Parameters

Scheduler Configuration

Available Schedulers

Common Parameters

Inverse Square Root Scheduler Parameters

Training Configuration

Gradient Accumulation

Precision Options

Distributed Training

FSDP Configuration

Tensor Parallelism

Logging and Monitoring

Console Logging

Integration Loggers

Weights & Biases

MLflow

TensorBoard

Callbacks

Standard Callbacks

Checkpointing Callbacks

Advanced Callbacks

Checkpointing

Evaluation

ICL Tasks

Eval Gauntlet

System Configuration

Example Configurations

Minimal Pretraining Config

Fine-tuning Config

Memory-Optimized Training Config

Maximum Memory Savings Config

`logit_scale` (float/str)

`init_div_is_residual` (bool/float/str/int)

`fc_type` (str/dict)