This repository hosts the ZIP-FIT data selection framework, designed to effectively and efficiently select relevant training data for language models from any data source based on a specified target dataset.
ZIP-FIT is optimized for:
- Rapid, large-scale data selection from extensive raw text datasets.
- Identifying data that closely aligns with the distribution of a given target dataset (e.g., domain-specific data, HumanEval, etc.).