Reinforcement Learning Report

On Monday, February 10th, we started with the goal to train a SOTA model with reinforcement learning in two weeks time. Now, a little over two weeks later, we unfortunately have not achieved our goal. However, we may have trained a SOTA open-source model on a niche, but challenging logical task. Here I hope to document some of the novel challenges and solutions I found in the process.

Getting Started

I started by building some modest infrastructure to make benchmarking, collecting distillation data, and generating samples easier, while maintaining flexibility. I also wanted to be able to easily observe and monitor inference progress, so I pulled in and enhanced some utility code I had previously created to consume a chat completion stream with the ability to monitor chunks and build up a standard OpenAI ChatCompletion object. Using the on_chunk callback, we are able to both observe token consumption in real time and stream the results to log files for local inspection. The relevant code can be found in /experiments/lib/stream.py.

Second, I wanted to be able to cache responses, both to save dollar spend/compute time and to speed up running redundant benchmarks. I explored various backends for caching, including Redis, and ultimately landed on using the https://developmentseed.org/obstore library to persist requests to disk and to any valid object store (in my case, a Google Cloud Storage bucket). Unique cache keys are created from both client params (such as the base url) and chat completion creation params, and saved to a requests jsonl file. Using the get_chat_completion utility function in the chat_completions.py library file caches requests by default, but for actual training I found myself disabling the feature, however it was handy for benchmarking and collecting distillation data.

Third, I wanted to create a simple, generic interface for defining the reinforcement learning problem. I settled on a Task data type inspired by OpenPipe's data types that simply has two properties messages and grader. The main difference is that the grader is a bit more flexible, allowing asynchronous returns and optional metric passing. The main utility function is get_task_results, that given a list of tasks and a AsyncOpenAI client instance, will asynchronously return the corresponding task results with a script & notebook friendly progress bar to monitor progress. It utilizes the previously mentioned on_chunk callback function to monitor token logprob counts in realtime and optionally calculate estimated dollar spend, for example when collecting distillation data or benchmarking. Both get_task_results and get_chat_completion have a lot of options, some of which can probably be removed now. Task related code may be found in tasks.py.With these utilities in hand, we can define our tasks.

Collecting training data was actually one of the first, if not the first, things I did. I collected game data for NYT Connections from multiple sources, but ultimately opted to just use @Kyle Corbitt's collection which I think had the most examples. I also collected data for the Allen Institute's Zebra Grid logic benchmark, ARC-AGI, and included data from my own benchmark, temporal-clue. Each task has a corresponding file in the library that providers utility function(s) to get their respective tasks.To benchmark the tasks, I mostly relied on a couple providers, starting with OpenRouter because of its extensive coverage, but ultimately signing up for Fireworks to get reliable (but pricy) DeepSeek R1 completions. Unfortunately, OpenRouter has not been terribly reliable and returns many unhelpful error messages, but it could be that I'm simply holding it wrong and that further debugging would easily fix the issues. My feeling now, however, is that it is easier to go directly to providers like Fireworks, even if more time consuming. I also ran into some quota restrictions with OpenAI benchmarking o3-mini, so I may need to increase my spend to bump my service tier.

Caveats aside, I found the NYT Connections benchmark very challenging. At the start, my focus was on DeepSeek R1 distilled reasoning models, however I found that the models, especially the smallest ones, had unimpressive performance. I think the Connections task and general knowledge tasks are difficult to hill climb with smaller models because world knowledge is very important which is directly correlated with parameter count. Even the best open source model, DeepSeek R1, lagged noticeably behind o-series models, making SOTA prospects dim.

I briefly tested ARC-AGI, but honestly wasn't able to give it much attention. I do think it's going to be a fairly hard benchmark, especially to beat o3's performance which apparently cost north of $1MM dollars in compute. I believe Zebra Grid is much more doable, but in my few experiments I haven't been able to break into the 70%'s. If we want to beat a recognized benchmark, this is probably our best bet though it would require a lot more experiments and probably more compute than the Temporal Clue benchmark.

Ultimately I leaned into what I was familiar with and focused on my Temporal Clue dataset. This is unfortunately a completely unknown task, though technically available in the depths of a public open source repo, I have made no efforts to publicize it. Inspired by Clue with additional temporal (time-based) complexity, the puzzles are generated with some code and a CP-SAT solver is used to find a minimal set of clues that allow you to solve the puzzle. While I cannot absolutely rule out bugs, the solver is pretty strict and will throw scary errors at you if you get the constraints wrong. If anything,, the constraints may be underspecified which means that there may be some redundant clues which shouldn't be a problem. The various types of clues test associational, spatial, temporal, and logical reasoning. The puzzle prompt, including a set of foundational and bonus questions, as well as the solution, are included in each example. Part of the task is format adherence, but I think the format is pretty straightforward for most LLMs to follow. An example puzzle may be seen here https://grok.com/share/bGVnYWN5_ec070956-7272-4ed4-b15d-d15401544b3c.

Training Recipe

Out of expediency, I admittedly once again leaned into the familiar as I created the training recipe. For inference, I used vLLM. I would be interested to try SGLang which may be more performant, though it may be missing critical features. vLLM is fairly robust and is approaching a v1 release this year. It includes a lot of arguments that require careful tuning to maximize throughput. I'm not sure I've found the perfect settings, though I've tried a many permutations. Here are the arguments I'm currently using:

named_arguments=dict(
    # Stuck with this from previous experimentation
    block_size=32,
    # Reduce log noise
    disable_log_requests=True,
    # This is very important, especially when generating lots of samples for the same tasks
    enable_prefix_caching=True,
    # While CUDA graph can be faster, it slows down startup time. Could be worth experimenting some more.
    enforce_eager=True,
    # Found I could bumpt this up from the standard 90% without OOM errors
    gpu_memory_utilization=0.95,
    # May still be too short for reasoning models, but I've not tested
    max_model_len=16384,
    # Just choose a large number, actually throttle the number of requests on the client side with a semaphore
    max_num_seqs=1024,
    # I believe having this as the same as max_model_len is best for performance
    max_num_batched_tokens=16384,
    # Bumping scheduler steps up helps kepp the CPU from being a bottleneck
    num_scheduler_steps=8,
    # I try to schedule requests so that premption is not needed, but when it does happen, I prefer swap
    preemption_mode="swap",
    # This lets us get token ids back from the chat completion API
    return_tokens_as_token_ids=True,
    # 80GB has been sufficient
    swap_space=80,
    # Set the tensor parallel size to the number of GPUs
    tensor_parallel_size=torch.cuda.device_count(),
)

Noticeably absent is chunked prefill. I experimented with it, but it didn't seem to help. If you have any suggestions, let me know. I think there are a lot of possible performance improvements on the inference side. Specifically avoiding startup time, but that will take considerable engineering effort. Some lower hanging fruit may be improving request scheduling. I've begun laying the groundwork for more sophisticated scheduling to maximize KV cache utilization. Also, the Kimi paper shows that allowing for some off-policy data can improve throughput as well, by limiting the maximum number of tokens to sample each iteration and saving unfinished responses for the next iteration. It may hurt learning performance slightly, but probably would make up for it faster iteration time.

For the tuning stage, I again went with what I was familiar with, torchtune. It's a very flexible post-training library by the PyTorch team that provides many training utilities and performance optimizations, as well as ready-to-go recipes that you can copy and modify for your needs. I used my previous recipe as a base that I modified before to support both single and multiple device training, as well as loading and swapping reference model weights, and training an optional MLP value head. It includes support for the following torchtune features:

Fully Sharded Data Parallel (FSDP) training
FSDP CPU "offloading of parameters, gradients, and optimizer states."
Activation Checkpointing
Activation Offloading
FP32 and BF16 Precisions
Gradient Accumulation
Transformer Output Chunking

The recipe can be found in recipe.py. I also want to add tensor parallism support which is available on the main branch, but not yet part of an official release (as far as I am aware). They are also adding support for multinode training if we want to train with more than 8 GPUs. They also have LoRA/QLoRA support, but I have not added it as I felt that probably wouldn't help if our main objective is to achieve SOTA. However, as I've run experiments it's become apparent that scaling the number of iterations is probably the most important factor and as a practical matter, memory optimizations might help us achieve better performance by being able to scale iterations/data faster.

The data preparation pipeline has a few components. The previously mentioned get_task_results method allows passing a result transform function. For training, we pass a callable instance of the TaskResultTokenizer class. It uses a HuggingFace tokenizer to create TokenizedResult instances with a number of helpful parameters including string tokens, integer token ids, an assistant token mask, input positions, and token logprobs. The assistant token mask requires the chat template string to have {% generation %}/{% endgeneration %} tags wrapping the assistant's generated content. We modify the default chat templates to include these generation tags. The assistant token mask is used to only select valid token locations in the GRPO loss function.

After tokenizing the task results, we pack the results into a tensor dictionary. The packed_tensors_from_tokenized_results method compactly packs the results into a set of tensors necessary for the tuning step. To pack multiple prompt/response pairs into the same sequences, we assign each token a group id to be used when creating the causal mask. Additionally, to avoid repeating the same prompt when there are multiple children responses in the same sequence, we also assign parent ids that are also used when creating the causal mask. This allows us to efficiently pack tokens into the training data without too much wasted space for padding. Currently we pack the sequences greedily in order. There may be ways to improve packing efficiency further with a more sophisticated strategy.

Packed sequences may be visualized with the plot_packed_tensors utility function. Here is an example visualization of the packed tensor data:

After packing, the torchtune recipe may be executed with the tune method. It creates a config file and runs the tune step in its own subprocess. Upon completion, the last epoch data is renamed and can be used to start a new vLLM server and further tuned after additional training data has been generated.

Experiments

With this general training recipe, I ran many experiments. Permutations included different models, learning rates, batch sizes, tasks (e.g. temporal-clue, zebra-grid, etc.) on different nodes with up to 8xH100s. I also conducted some research experiments, most notably with a soft alternative to standard PPO clipping (doesn't seem to help), and "surprise" reward shaping to encourage diverse responses which did increase diversity, but not performance on the actual task. Here is a summary of some of the experiments:

010-reinforce: Initial experiment without hard or soft clipping.
012-grpo through 016-grpo: Initial GRPO experiments on the nyt-connections task with the distilled reasoning models. Performance was poor.
018-eight-h100s-test: Initial Llama 70B on 8xH100 tests.
022-rl123-repro: A successful reproduction of a previous experiment on the temporal-clue task.
023-distilled-qwen-7b: A unsuccessful attempt to train the distilled reasoning Qwen 7B model on the temporal-clue task.
024-clipping and 025-clipping: Hard clipping experiments
026-dynamic-learning-rate: An experiment to change the learning rate based on number of training sequences. Required modification.
027-entropy-bonus: Experimentation with an entropy bonus. I don't think it made much of a difference, and too much entropy can lead to instability. Still worth exploring.
028-zebra-grid: An experiment with the zebra-grid task.
029-distilled-qwen-7b: Another distilled Qwen 7B experiment. Quickly aborted for poor performance.
030-qwen-14b: A short Qwen-14B experiment. Not sure if this was a distilled or standard model.
031-zebra-grid-qwen-14b: A Zebra Grid Qwen-14B experiment. Went fairly well, but performance seemed to hit a ceiling after 40 steps.
034-surprise-bonus through 037-surprise-bonus: While it succesfully encouraged diverse responses, it mostly reward hacked the grader by generating long, non-sensical responses. However did spur the creation of the InferenceEarlyStop utility class.
038-tune-performance and 039-tune-performance: Some good experiments that started as performance focused, but ultimately 039 has been one of the most successful experiments.
040-qwen-32b: Similar to 039, but with a larger model. Disappointing results, the batch size or learning rate may not have been tuned right and would probably benefit from hard clipping as you will see.
041-llama-70b: Another similar run, this one with the biggest model possible and allowed to run the longest. Has possible achieved open source SOTA performance on the temporal-clue benchmark.
042-nyt-connections-qwen-14b: Another attempt at the nyt-connections task, but ultimately struggled and was aborted after 8 steps.
043-zebra-grid-qwen-32b: A new attempt at the zebra-grid task, results peaked about the same level of performance as 031, but on step #8.
044-clipping: Another experiment with hard clipping that went very well. Apparently needed a larger learning rate. Probably the best performing run when considering base model size. Doing the vast majority of experiments with soft clipping may have been a big mistake. 😬
045-qwen-72b: A quick attempt at using a big model other than Llama 70B, but it did not go well.
046-distilled-qwen-32b: Another attempt at using a reasoning model, this time a larger one, but was quickly aborted.
047-clipping: Another hard clipping experiment with an even larger learning rate, which appears to have been too large.
048-implicit-rewards and 049-implicit-rewards: Two experiments to test a novel research idea, incorporating implicit rewards inspird by PRIME. More experiments are needed to determine the efficacy, probably with higher weight to implicit-reward weighted data.

Given more time and compute, there are many more experiments I would like to run, especially with hard clipping and with implicit rewards. That along with performance optimizations are the most important next directions in my view.

Conclusion

Unfortunately, I was not able to train an objectively SOTA model on a well-known benchmark. I do think think that the recipe is scalable, and that SOTA results could be achieved. Many published results are from training runs with 100s or even 1000s of iterations (DeepSeek R1 was trained for ~8000 steps). To increase probability of success we should consider focusing on larger models and perhaps multinode training. However, the lowest hanging and best next steps in my view are performance optimizations, particularily adding support for tensor parallelism, maybe quantization, reducing/eliminating vLLM startup time, increasing vLLM KV cache utilization, and experimenting with off-policy data, Kimi-style. I also believe that implicit rewards could potentially improve performance if we can match PRIME's 2.5x sample efficiency. Additionally, the hyperparameter space is very large and we are probably at a local maxima and could benefit from additional exploration.

bradhilton/report.md

Reinforcement Learning Report

Getting Started

Training Recipe

Experiments

Conclusion