Skip to content

Instantly share code, notes, and snippets.

@wassname
Last active March 21, 2025 06:57
Show Gist options
  • Save wassname/139a5d47285031cbb3b53757bff01ed6 to your computer and use it in GitHub Desktop.
Save wassname/139a5d47285031cbb3b53757bff01ed6 to your computer and use it in GitHub Desktop.
huggingface datasets cheatsheet for text functional operations

what fnctional operations are there?

  • Dataset methods:

    • columns remove_columns rename_columns select_columns
    • map
      • reduce: in batched mode it can change the batch size or have side effects, letting us use it as a reduce
    • filter
    • select: this is how you slice
      • take: like head
    • shuffle
    • sort
    • train_test_split: like sklearn
    • search
      • search_batch
      • get_nearest_examples
      • get_nearest_examples_batch
  • dataset methods

    • concatenate_datasets
    • interleave_datasets

how to join two datasets?

from datasets import concatenate_datasets
c = concatenate_datasets([
    a.rename_columns({k:f'old_{k}' for k in a.column_names}),
    b.rename_columns({k:f'new_{k}' for k in b.column_names}),
])

how to reduce?

acc = {"max": 0}

def max_(col):
  acc["max"] = max([acc["max"]] + [len(text) for text in col])

# this wont work for multiprocessing, for that see the linked colar https://colab.research.google.com/drive/1jCLv31Y4cDfqD0lhO0AnqEv3Or-LLvWe?usp=sharing#scrollTo=dOB57NHroLxx
ds.map(max_, input_columns="text", batched=True)

acc["max"]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment