Nathan Lambert natolambert

Recently, I learned that some of the top reward models on RewardBench were trained on a preference dataset that has unintentional contamination with the benchmark. The dataset, Skyworks Preferences 80k contains contamination by mixing a Magpie dataset in. Magpie is a new method for having language models generate instructions by prompting them with an empty chat template. The source for the Skyworks dataset that was contaminated is Argilla/magpie-ultra-v0.1, generated with Llama 3.1 405B Instruct. I would never expect a Magpie dataset to be contaminated.

What seems likely is that Meta trained on some these prompts, but the exact provenance of each prompt needs more example. For example, we learned that some of the prompts we used in our LLMBar subsets they got from popular training sets like Al

Here is my personal opinion about the questions I posed in this tweet:

I think that fine-tuning is still very valuable in many situations. I’ve done some more digging and I find that people who say that fine-tuning isn't useful are indeed often working on products where fine-tuning isn't likely to be useful:

They are making developer tools - foundation models have been trained extensively on coding tasks.
They are building foundation models and testing for the most general cases. But the foundation models themselves are also being trained for the most general cases.
They are building a personal assistant that isn’t scoped to any type of domain or use case and is essentially similar to the same folks building foundation models.

	#!/usr/bin/env python3
	"""
	Human quality transcripts from audio files using
	AssemblyAI for transcription and Google's Gemini for enhancement.

	Requirements:
	- AssemblyAI API key (https://www.assemblyai.com/)
	- Google API key (https://aistudio.google.com/)
	- Python packages: assemblyai, google-generativeai, pydub

	from typing import Dict, List
	from rich.console import Console
	from rich.panel import Panel
	from datasets import load_dataset

	def print_hf_messages(messages: List[Dict[str, str]]):
	console = Console()
	colors = ["red", "green"]
	color_idx = 0
	console.rule(f"[bold yellow]The number of turns is {len(messages)}")