-
Dataset
methods:- columns
remove_columns
rename_columns
select_columns
- map
- reduce: in batched mode it can change the batch size or have side effects, letting us use it as a reduce
- filter
- select: this is how you slice
- take: like head
- shuffle
- sort
- train_test_split: like sklearn
- search
- search_batch
- get_nearest_examples
- get_nearest_examples_batch
- columns
-
dataset
methods- concatenate_datasets
- interleave_datasets
from datasets import concatenate_datasets
c = concatenate_datasets([
a.rename_columns({k:f'old_{k}' for k in a.column_names}),
b.rename_columns({k:f'new_{k}' for k in b.column_names}),
])
- see https://colab.research.google.com/drive/1jCLv31Y4cDfqD0lhO0AnqEv3Or-LLvWe?usp=sharing#scrollTo=dOB57NHroLxx
- docs on batch mapping https://huggingface.co/docs/datasets/en/about_map_batch
acc = {"max": 0}
def max_(col):
acc["max"] = max([acc["max"]] + [len(text) for text in col])
# this wont work for multiprocessing, for that see the linked colar https://colab.research.google.com/drive/1jCLv31Y4cDfqD0lhO0AnqEv3Or-LLvWe?usp=sharing#scrollTo=dOB57NHroLxx
ds.map(max_, input_columns="text", batched=True)
acc["max"]