wassname/hf_datasets_cheatsheet.md

Last active March 21, 2025 06:57

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/wassname/139a5d47285031cbb3b53757bff01ed6.js"></script>
Save wassname/139a5d47285031cbb3b53757bff01ed6 to your computer and use it in GitHub Desktop.

Download ZIP

huggingface datasets cheatsheet for text functional operations

Raw

hf_datasets_cheatsheet.md

what fnctional operations are there?

Dataset methods:
- columns remove_columns rename_columns select_columns
- map
  - reduce: in batched mode it can change the batch size or have side effects, letting us use it as a reduce
- filter
- select: this is how you slice
  - take: like head
- shuffle
- sort
- train_test_split: like sklearn
- search
  - search_batch
  - get_nearest_examples
  - get_nearest_examples_batch
dataset methods
- concatenate_datasets
- interleave_datasets

how to join two datasets?

from datasets import concatenate_datasets
c = concatenate_datasets([
    a.rename_columns({k:f'old_{k}' for k in a.column_names}),
    b.rename_columns({k:f'new_{k}' for k in b.column_names}),
])

how to reduce?

acc = {"max": 0}

def max_(col):
  acc["max"] = max([acc["max"]] + [len(text) for text in col])

# this wont work for multiprocessing, for that see the linked colar https://colab.research.google.com/drive/1jCLv31Y4cDfqD0lhO0AnqEv3Or-LLvWe?usp=sharing#scrollTo=dOB57NHroLxx
ds.map(max_, input_columns="text", batched=True)

acc["max"]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment