s-smits · December 20, 2024 11:47
diff --git a/gistfile1.txt b/gistfile1.txt
 <finetuning_example>Let’s go ahead and download DistilBERT using the AutoModelForMaskedLM class:

 Copied
 from transformers import AutoModelForMaskedLM

 model_checkpoint = "distilbert-base-uncased"
 model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)
 We can see how many parameters this model has by calling the num_parameters() method:

 Copied
 distilbert_num_parameters = model.num_parameters() / 1_000_000
 print(f"'>>> DistilBERT number of parameters: {round(distilbert_num_parameters)}M'")
 print(f"'>>> BERT number of parameters: 110M'")
 Copied
 '>>> DistilBERT number of parameters: 67M'
 '>>> BERT number of parameters: 110M'
 With around 67 million parameters, DistilBERT is approximately two times smaller than the BERT base model, which roughly translates into a two-fold speedup in training — nice! Let’s now see what kinds of tokens this model predicts are the most likely completions of a small sample of text:

 Copied
 text = "This is a great [MASK]."
 As humans, we can imagine many possibilities for the [MASK] token, such as “day”, “ride”, or “painting”. For pretrained models, the predictions depend on the corpus the model was trained on, since it learns to pick up the statistical patterns present in the data. Like BERT, DistilBERT was pretrained on the English Wikipedia and BookCorpus datasets, so we expect the predictions for [MASK] to reflect these domains. To predict the mask we need DistilBERT’s tokenizer to produce the inputs for the model, so let’s download that from the Hub as well:

 Copied
 from transformers import AutoTokenizer

 tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
 With a tokenizer and a model, we can now pass our text example to the model, extract the logits, and print out the top 5 candidates:

 Copied
 import torch

 inputs = tokenizer(text, return_tensors="pt")
 token_logits = model(**inputs).logits

 Find the location of [MASK] and extract its logits

 mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
 mask_token_logits = token_logits[0, mask_token_index, :]

 Pick the [MASK] candidates with the highest logits

 top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()

 for token in top_5_tokens:
 print(f"'>>> {text.replace(tokenizer.mask_token, tokenizer.decode([token]))}'")
 Copied
 '>>> This is a great deal.'
 '>>> This is a great success.'
 '>>> This is a great adventure.'
 '>>> This is a great idea.'
 '>>> This is a great feat.'
 We can see from the outputs that the model’s predictions refer to everyday terms, which is perhaps not surprising given the foundation of English Wikipedia. Let’s see how we can change this domain to something a bit more niche — highly polarized movie reviews!

 The dataset
 To showcase domain adaptation, we’ll use the famous Large Movie Review Dataset (or IMDb for short), which is a corpus of movie reviews that is often used to benchmark sentiment analysis models. By fine-tuning DistilBERT on this corpus, we expect the language model will adapt its vocabulary from the factual data of Wikipedia that it was pretrained on to the more subjective elements of movie reviews. We can get the data from the Hugging Face Hub with the load_dataset() function from 🤗 Datasets:

 Copied
 from datasets import load_dataset

 imdb_dataset = load_dataset("imdb")
 imdb_dataset
 Copied
 DatasetDict({
 train: Dataset({
 features: ['text', 'label'],
 num_rows: 25000
 })
 test: Dataset({
 features: ['text', 'label'],
 num_rows: 25000
 })
 unsupervised: Dataset({
 features: ['text', 'label'],
 num_rows: 50000
 })
 })
 We can see that the train and test splits each consist of 25,000 reviews, while there is an unlabeled split called unsupervised that contains 50,000 reviews. Let’s take a look at a few samples to get an idea of what kind of text we’re dealing with. As we’ve done in previous chapters of the course, we’ll chain the Dataset.shuffle() and Dataset.select() functions to create a random sample:

 Copied
 sample = imdb_dataset["train"].shuffle(seed=42).select(range(3))

 for row in sample:
 print(f"\n'>>> Review: {row['text']}'")
 print(f"'>>> Label: {row['label']}'")
 Copied

 '>>> Review: This is your typical Priyadarshan movie--a bunch of loony characters out on some silly mission. His signature climax has the entire cast of the film coming together and fighting each other in some crazy moshpit over hidden money. Whether it is a winning lottery ticket in Malamaal Weekly, black money in Hera Pheri, "kodokoo" in Phir Hera Pheri, etc., etc., the director is becoming ridiculously predictable. Don't get me wrong; as clichéd and preposterous his movies may be, I usually end up enjoying the comedy. However, in most his previous movies there has actually been some good humor, (Hungama and Hera Pheri being noteworthy ones). Now, the hilarity of his films is fading as he is using the same formula over and over again.<br /><br />Songs are good. Tanushree Datta looks awesome. Rajpal Yadav is irritating, and Tusshar is not a whole lot better. Kunal Khemu is OK, and Sharman Joshi is the best.'
 '>>> Label: 0'

 '>>> Review: Okay, the story makes no sense, the characters lack any dimensionally, the best dialogue is ad-libs about the low quality of movie, the cinematography is dismal, and only editing saves a bit of the muddle, but Sam" Peckinpah directed the film. Somehow, his direction is not enough. For those who appreciate Peckinpah and his great work, this movie is a disappointment. Even a great cast cannot redeem the time the viewer wastes with this minimal effort.<br /><br />The proper response to the movie is the contempt that the director San Peckinpah, James Caan, Robert Duvall, Burt Young, Bo Hopkins, Arthur Hill, and even Gig Young bring to their work. Watch the great Peckinpah films. Skip this mess.'
 '>>> Label: 0'

 '>>> Review: I saw this movie at the theaters when I was about 6 or 7 years old. I loved it then, and have recently come to own a VHS version. <br /><br />My 4 and 6 year old children love this movie and have been asking again and again to watch it. <br /><br />I have enjoyed watching it again too. Though I have to admit it is not as good on a little TV.<br /><br />I do not have older children so I do not know what they would think of it. <br /><br />The songs are very cute. My daughter keeps singing them over and over.<br /><br />Hope this helps.'
 '>>> Label: 1'
 Yep, these are certainly movie reviews, and if you’re old enough you may even understand the comment in the last review about owning a VHS version 😜! Although we won’t need the labels for language modeling, we can already see that a 0 denotes a negative review, while a 1 corresponds to a positive one.

 ✏️ Try it out! Create a random sample of the unsupervised split and verify that the labels are neither 0 nor 1. While you’re at it, you could also check that the labels in the train and test splits are indeed 0 or 1 — this is a useful sanity check that every NLP practitioner should perform at the start of a new project!

 Now that we’ve had a quick look at the data, let’s dive into preparing it for masked language modeling. As we’ll see, there are some additional steps that one needs to take compared to the sequence classification tasks we saw in Chapter 3. Let’s go!

 Preprocessing the data

 For both auto-regressive and masked language modeling, a common preprocessing step is to concatenate all the examples and then split the whole corpus into chunks of equal size. This is quite different from our usual approach, where we simply tokenize individual examples. Why concatenate everything together? The reason is that individual examples might get truncated if they’re too long, and that would result in losing information that might be useful for the language modeling task!

 So to get started, we’ll first tokenize our corpus as usual, but without setting the truncation=True option in our tokenizer. We’ll also grab the word IDs if they are available ((which they will be if we’re using a fast tokenizer, as described in Chapter 6), as we will need them later on to do whole word masking. We’ll wrap this in a simple function, and while we’re at it we’ll remove the text and label columns since we don’t need them any longer:

 Copied
 def tokenize_function(examples):
 result = tokenizer(examples["text"])
 if tokenizer.is_fast:
 result["word_ids"] = [result.word_ids(i) for i in range(len(result["input_ids"]))]
 return result

 Use batched=True to activate fast multithreading!

 tokenized_datasets = imdb_dataset.map(
 tokenize_function, batched=True, remove_columns=["text", "label"]
 )
 tokenized_datasets
 Copied
 DatasetDict({
 train: Dataset({
 features: ['attention_mask', 'input_ids', 'word_ids'],
 num_rows: 25000
 })
 test: Dataset({
 features: ['attention_mask', 'input_ids', 'word_ids'],
 num_rows: 25000
 })
 unsupervised: Dataset({
 features: ['attention_mask', 'input_ids', 'word_ids'],
 num_rows: 50000
 })
 })
 Since DistilBERT is a BERT-like model, we can see that the encoded texts consist of the input_ids and attention_mask that we’ve seen in other chapters, as well as the word_ids we added.

 Now that we’ve tokenized our movie reviews, the next step is to group them all together and split the result into chunks. But how big should these chunks be? This will ultimately be determined by the amount of GPU memory that you have available, but a good starting point is to see what the model’s maximum context size is. This can be inferred by inspecting the model_max_length attribute of the tokenizer:

 Copied
 tokenizer.model_max_length
 Copied
 512
 This value is derived from the tokenizer_config.json file associated with a checkpoint; in this case we can see that the context size is 512 tokens, just like with BERT.

 ✏️ Try it out! Some Transformer models, like BigBird and Longformer, have a much longer context length than BERT and other early Transformer models. Instantiate the tokenizer for one of these checkpoints and verify that the model_max_length agrees with what’s quoted on its model card.

 So, in order to run our experiments on GPUs like those found on Google Colab, we’ll pick something a bit smaller that can fit in memory:

 Copied
 chunk_size = 128
 Note that using a small chunk size can be detrimental in real-world scenarios, so you should use a size that corresponds to the use case you will apply your model to.

 Now comes the fun part. To show how the concatenation works, let’s take a few reviews from our tokenized training set and print out the number of tokens per review:

 Copied

 Slicing produces a list of lists for each feature

 tokenized_samples = tokenized_datasets["train"][:3]

 for idx, sample in enumerate(tokenized_samples["input_ids"]):
 print(f"'>>> Review {idx} length: {len(sample)}'")
 Copied
 '>>> Review 0 length: 200'
 '>>> Review 1 length: 559'
 '>>> Review 2 length: 192'
 We can then concatenate all these examples with a simple dictionary comprehension, as follows:

 Copied
 concatenated_examples = {
 k: sum(tokenized_samples[k], []) for k in tokenized_samples.keys()
 }
 total_length = len(concatenated_examples["input_ids"])
 print(f"'>>> Concatenated reviews length: {total_length}'")
 Copied
 '>>> Concatenated reviews length: 951'
 Great, the total length checks out — so now let’s split the concatenated reviews into chunks of the size given by chunk_size. To do so, we iterate over the features in concatenated_examples and use a list comprehension to create slices of each feature. The result is a dictionary of chunks for each feature:

 Copied
 chunks = {
 k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
 for k, t in concatenated_examples.items()
 }

 for chunk in chunks["input_ids"]:
 print(f"'>>> Chunk length: {len(chunk)}'")
 Copied
 '>>> Chunk length: 128'
 '>>> Chunk length: 128'
 '>>> Chunk length: 128'
 '>>> Chunk length: 128'
 '>>> Chunk length: 128'
 '>>> Chunk length: 128'
 '>>> Chunk length: 128'
 '>>> Chunk length: 55'
 As you can see in this example, the last chunk will generally be smaller than the maximum chunk size. There are two main strategies for dealing with this:

 Drop the last chunk if it’s smaller than chunk_size.
 Pad the last chunk until its length equals chunk_size.
 We’ll take the first approach here, so let’s wrap all of the above logic in a single function that we can apply to our tokenized datasets:

 Copied
 def group_texts(examples):
 # Concatenate all texts
 concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
 # Compute length of concatenated texts
 total_length = len(concatenated_examples[list(examples.keys())[0]])
 # We drop the last chunk if it's smaller than chunk_size
 total_length = (total_length // chunk_size) * chunk_size
 # Split by chunks of max_len
 result = {
 k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
 for k, t in concatenated_examples.items()
 }
 # Create a new labels column
 result["labels"] = result["input_ids"].copy()
 return result
 Note that in the last step of group_texts() we create a new labels column which is a copy of the input_ids one. As we’ll see shortly, that’s because in masked language modeling the objective is to predict randomly masked tokens in the input batch, and by creating a labels column we provide the ground truth for our language model to learn from.

 Let’s now apply group_texts() to our tokenized datasets using our trusty Dataset.map() function:

 Copied
 lm_datasets = tokenized_datasets.map(group_texts, batched=True)
 lm_datasets
 Copied
 DatasetDict({
 train: Dataset({
 features: ['attention_mask', 'input_ids', 'labels', 'word_ids'],
 num_rows: 61289
 })
 test: Dataset({
 features: ['attention_mask', 'input_ids', 'labels', 'word_ids'],
 num_rows: 59905
 })
 unsupervised: Dataset({
 features: ['attention_mask', 'input_ids', 'labels', 'word_ids'],
 num_rows: 122963
 })
 })
 You can see that grouping and then chunking the texts has produced many more examples than our original 25,000 for the train and test splits. That’s because we now have examples involving contiguous tokens that span across multiple examples from the original corpus. You can see this explicitly by looking for the special [SEP] and [CLS] tokens in one of the chunks:

 Copied
 tokenizer.decode(lm_datasets["train"][1]["input_ids"])
 Copied
 ".... at.......... high. a classic line : inspector : i'm here to sack one of your teachers. student : welcome to bromwell high. i expect that many adults of my age think that bromwell high is far fetched. what a pity that it isn't! [SEP] [CLS] homelessness ( or houselessness as george carlin stated ) has been an issue for years but never a plan to help those on the street that were once considered human who did everything from going to school, work, or vote for the matter. most people think of the homeless"
 In this example you can see two overlapping movie reviews, one about a high school movie and the other about homelessness. Let’s also check out what the labels look like for masked language modeling:

 Copied
 tokenizer.decode(lm_datasets["train"][1]["labels"])
 Copied
 ".... at.......... high. a classic line : inspector : i'm here to sack one of your teachers. student : welcome to bromwell high. i expect that many adults of my age think that bromwell high is far fetched. what a pity that it isn't! [SEP] [CLS] homelessness ( or houselessness as george carlin stated ) has been an issue for years but never a plan to help those on the street that were once considered human who did everything from going to school, work, or vote for the matter. most people think of the homeless"
 As expected from our group_texts() function above, this looks identical to the decoded input_ids — but then how can our model possibly learn anything? We’re missing a key step: inserting [MASK] tokens at random positions in the inputs! Let’s see how we can do this on the fly during fine-tuning using a special data collator.

 Fine-tuning DistilBERT with the Trainer API
 Fine-tuning a masked language model is almost identical to fine-tuning a sequence classification model, like we did in Chapter 3. The only difference is that we need a special data collator that can randomly mask some of the tokens in each batch of texts. Fortunately, 🤗 Transformers comes prepared with a dedicated DataCollatorForLanguageModeling for just this task. We just have to pass it the tokenizer and an mlm_probability argument that specifies what fraction of the tokens to mask. We’ll pick 15%, which is the amount used for BERT and a common choice in the literature:

 Copied
 from transformers import DataCollatorForLanguageModeling

 data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)
 To see how the random masking works, let’s feed a few examples to the data collator. Since it expects a list of dicts, where each dict represents a single chunk of contiguous text, we first iterate over the dataset before feeding the batch to the collator. We remove the "word_ids" key for this data collator as it does not expect it:

 Copied
 samples = [lm_datasets["train"][i] for i in range(2)]
 for sample in samples:
 _ = sample.pop("word_ids")

 for chunk in data_collator(samples)["input_ids"]:
 print(f"\n'>>> {tokenizer.decode(chunk)}'")
 Copied
 '>>> [CLS] bromwell [MASK] is a cartoon comedy. it ran at the same [MASK] as some other [MASK] about school life, [MASK] as " teachers ". [MASK] [MASK] [MASK] in the teaching [MASK] lead [MASK] to believe that bromwell high'[MASK] satire is much closer to reality than is " teachers ". the scramble [MASK] [MASK] financially, the [MASK]ful students whogn [MASK] right through [MASK] pathetic teachers'pomp, the pettiness of the whole situation, distinction remind me of the schools i knew and their students. when i saw [MASK] episode in [MASK] a student repeatedly tried to burn down the school, [MASK] immediately recalled. [MASK]...'

 '>>> .... at.. [MASK]... [MASK]... high. a classic line plucked inspector : i'[MASK] here to [MASK] one of your [MASK]. student : welcome to bromwell [MASK]. i expect that many adults of my age think that [MASK]mwell [MASK] is [MASK] fetched. what a pity that it isn't! [SEP] [CLS] [MASK]ness ( or [MASK]lessness as george 宇in stated )公 been an issue for years but never [MASK] plan to help those on the street that were once considered human [MASK] did everything from going to school, [MASK], [MASK] vote for the matter. most people think [MASK] the homeless'
 Nice, it worked! We can see that the [MASK] token has been randomly inserted at various locations in our text. These will be the tokens which our model will have to predict during training — and the beauty of the data collator is that it will randomize the [MASK] insertion with every batch!

 ✏️ Try it out! Run the code snippet above several times to see the random masking happen in front of your very eyes! Also replace the tokenizer.decode() method with tokenizer.convert_ids_to_tokens() to see that sometimes a single token from a given word is masked, and not the others.

 One side effect of random masking is that our evaluation metrics will not be deterministic when using the Trainer, since we use the same data collator for the training and test sets. We’ll see later, when we look at fine-tuning with 🤗 Accelerate, how we can use the flexibility of a custom evaluation loop to freeze the randomness.

 When training models for masked language modeling, one technique that can be used is to mask whole words together, not just individual tokens. This approach is called whole word masking. If we want to use whole word masking, we will need to build a data collator ourselves. A data collator is just a function that takes a list of samples and converts them into a batch, so let’s do this now! We’ll use the word IDs computed earlier to make a map between word indices and the corresponding tokens, then randomly decide which words to mask and apply that mask on the inputs. Note that the labels are all -100 except for the ones corresponding to mask words.

 Copied
 import collections
 import numpy as np

 from transformers import default_data_collator

 wwm_probability = 0.2

 def whole_word_masking_data_collator(features):
 for feature in features:
 word_ids = feature.pop("word_ids")

 # Create a map between words and corresponding token indices
    mapping = collections.defaultdict(list)
    current_word_index = -1
    current_word = None
    for idx, word_id in enumerate(word_ids):
        if word_id is not None:
            if word_id != current_word:
                current_word = word_id
                current_word_index += 1
            mapping[current_word_index].append(idx)

    # Randomly mask words
    mask = np.random.binomial(1, wwm_probability, (len(mapping),))
    input_ids = feature["input_ids"]
    labels = feature["labels"]
    new_labels = [-100] * len(labels)
    for word_id in np.where(mask)[0]:
        word_id = word_id.item()
        for idx in mapping[word_id]:
            new_labels[idx] = labels[idx]
            input_ids[idx] = tokenizer.mask_token_id
    feature["labels"] = new_labels

 return default_data_collator(features)
 content_copy
 download
 Use code with caution.

 Next, we can try it on the same samples as before:

 Copied
 samples = [lm_datasets["train"][i] for i in range(2)]
 batch = whole_word_masking_data_collator(samples)

 for chunk in batch["input_ids"]:
 print(f"\n'>>> {tokenizer.decode(chunk)}'")
 Copied
 '>>> [CLS] bromwell high is a cartoon comedy [MASK] it ran at the same time as some other programs about school life, such as " teachers ". my 35 years in the teaching profession lead me to believe that bromwell high's satire is much closer to reality than is " teachers ". the scramble to survive financially, the insightful students who can see right through their pathetic teachers'pomp, the pettiness of the whole situation, all remind me of the schools i knew and their students. when i saw the episode in which a student repeatedly tried to burn down the school, i immediately recalled.....'

 '>>> .... [MASK] [MASK] [MASK] [MASK]....... high. a classic line : inspector : i'm here to sack one of your teachers. student : welcome to bromwell high. i expect that many adults of my age think that bromwell high is far fetched. what a pity that it isn't! [SEP] [CLS] homelessness ( or houselessness as george carlin stated ) has been an issue for years but never a plan to help those on the street that were once considered human who did everything from going to school, work, or vote for the matter. most people think of the homeless'
 ✏️ Try it out! Run the code snippet above several times to see the random masking happen in front of your very eyes! Also replace the tokenizer.decode() method with tokenizer.convert_ids_to_tokens() to see that the tokens from a given word are always masked together.

 Now that we have two data collators, the rest of the fine-tuning steps are standard. Training can take a while on Google Colab if you’re not lucky enough to score a mythical P100 GPU 😭, so we’ll first downsample the size of the training set to a few thousand examples. Don’t worry, we’ll still get a pretty decent language model! A quick way to downsample a dataset in 🤗 Datasets is via the Dataset.train_test_split() function that we saw in Chapter 5:

 Copied
 train_size = 10_000
 test_size = int(0.1 * train_size)

 downsampled_dataset = lm_datasets["train"].train_test_split(
 train_size=train_size, test_size=test_size, seed=42
 )
 downsampled_dataset
 Copied
 DatasetDict({
 train: Dataset({
 features: ['attention_mask', 'input_ids', 'labels', 'word_ids'],
 num_rows: 10000
 })
 test: Dataset({
 features: ['attention_mask', 'input_ids', 'labels', 'word_ids'],
 num_rows: 1000
 })
 })
 This has automatically created new train and test splits, with the training set size set to 10,000 examples and the validation set to 10% of that — feel free to increase this if you have a beefy GPU! The next thing we need to do is log in to the Hugging Face Hub. If you’re running this code in a notebook, you can do so with the following utility function:

 Copied
 from huggingface_hub import notebook_login

 notebook_login()
 which will display a widget where you can enter your credentials. Alternatively, you can run:

 Copied
 huggingface-cli login
 in your favorite terminal and log in there.

 Once we’re logged in, we can specify the arguments for the Trainer:

 Copied
 from transformers import TrainingArguments

 batch_size = 64

 Show the training loss with every epoch

 logging_steps = len(downsampled_dataset["train"]) // batch_size
 model_name = model_checkpoint.split("/")[-1]

 training_args = TrainingArguments(
 output_dir=f"{model_name}-finetuned-imdb",
 overwrite_output_dir=True,
 evaluation_strategy="epoch",
 learning_rate=2e-5,
 weight_decay=0.01,
 per_device_train_batch_size=batch_size,
 per_device_eval_batch_size=batch_size,
 push_to_hub=True,
 fp16=True,
 logging_steps=logging_steps,
 )
 Here we tweaked a few of the default options, including logging_steps to ensure we track the training loss with each epoch. We’ve also used fp16=True to enable mixed-precision training, which gives us another boost in speed. By default, the Trainer will remove any columns that are not part of the model’s forward() method. This means that if you’re using the whole word masking collator, you’ll also need to set remove_unused_columns=False to ensure we don’t lose the word_ids column during training.

 Note that you can specify the name of the repository you want to push to with the hub_model_id argument (in particular, you will have to use this argument to push to an organization). For instance, when we pushed the model to the huggingface-course organization, we added hub_model_id="huggingface-course/distilbert-finetuned-imdb" to TrainingArguments. By default, the repository used will be in your namespace and named after the output directory you set, so in our case it will be "lewtun/distilbert-finetuned-imdb".

 We now have all the ingredients to instantiate the Trainer. Here we just use the standard data_collator, but you can try the whole word masking collator and compare the results as an exercise:

 Copied
 from transformers import Trainer

 trainer = Trainer(
 model=model,
 args=training_args,
 train_dataset=downsampled_dataset["train"],
 eval_dataset=downsampled_dataset["test"],
 data_collator=data_collator,
 tokenizer=tokenizer,
 )
 We’re now ready to run trainer.train() — but before doing so let’s briefly look at perplexity, which is a common metric to evaluate the performance of language models.

 Perplexity for language models

 Unlike other tasks like text classification or question answering where we’re given a labeled corpus to train on, with language modeling we don’t have any explicit labels. So how do we determine what makes a good language model? Like with the autocorrect feature in your phone, a good language model is one that assigns high probabilities to sentences that are grammatically correct, and low probabilities to nonsense sentences. To give you a better idea of what this looks like, you can find whole sets of “autocorrect fails” online, where the model in a person’s phone has produced some rather funny (and often inappropriate) completions!

 Assuming our test set consists mostly of sentences that are grammatically correct, then one way to measure the quality of our language model is to calculate the probabilities it assigns to the next word in all the sentences of the test set. High probabilities indicates that the model is not “surprised” or “perplexed” by the unseen examples, and suggests it has learned the basic patterns of grammar in the language. There are various mathematical definitions of perplexity, but the one we’ll use defines it as the exponential of the cross-entropy loss. Thus, we can calculate the perplexity of our pretrained model by using the Trainer.evaluate() function to compute the cross-entropy loss on the test set and then taking the exponential of the result:

 Copied
 import math

 eval_results = trainer.evaluate()
 print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}")
 Copied

 Perplexity: 21.75
 A lower perplexity score means a better language model, and we can see here that our starting model has a somewhat large value. Let’s see if we can lower it by fine-tuning! To do that, we first run the training loop:

 Copied
 trainer.train()
 and then compute the resulting perplexity on the test set as before:

 Copied
 eval_results = trainer.evaluate()
 print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}")
 Copied

 Perplexity: 11.32
 Nice — this is quite a reduction in perplexity, which tells us the model has learned something about the domain of movie reviews!
 </finetuning_example>

 <model_to_finetune>
 ModernBERT
 Table of Contents
 Model Summary
 Usage
 Evaluation
 Limitations
 Training
 License
 Citation
 Model Summary
 ModernBERT is a modernized bidirectional encoder-only Transformer model (BERT-style) pre-trained on 2 trillion tokens of English and code data with a native context length of up to 8,192 tokens. ModernBERT leverages recent architectural improvements such as:

 Rotary Positional Embeddings (RoPE) for long-context support.
 Local-Global Alternating Attention for efficiency on long inputs.
 Unpadding and Flash Attention for efficient inference.
 ModernBERT’s native long context length makes it ideal for tasks that require processing long documents, such as retrieval, classification, and semantic search within large corpora. The model was trained on a large corpus of text and code, making it suitable for a wide range of downstream tasks, including code retrieval and hybrid (text + code) semantic search.

 It is available in the following sizes:

 ModernBERT-base - 22 layers, 149 million parameters
 ModernBERT-large - 28 layers, 395 million parameters
 For more information about ModernBERT, we recommend our release blog post for a high-level overview, and our arXiv pre-print for in-depth information.

 ModernBERT is a collaboration between Answer.AI, LightOn, and friends.

 Usage
 You can use these models directly with the transformers library. Until the next transformers release, doing so requires installing transformers from main:

 pip install git+https://github.com/huggingface/transformers.git

 Since ModernBERT is a Masked Language Model (MLM), you can use the fill-mask pipeline or load it via AutoModelForMaskedLM. To use ModernBERT for downstream tasks like classification, retrieval, or QA, fine-tune it following standard BERT fine-tuning recipes.

 ⚠️ If your GPU supports it, we recommend using ModernBERT with Flash Attention 2 to reach the highest efficiency. To do so, install Flash Attention as follows, then use the model as normal:

 pip install flash-attn

 Using AutoModelForMaskedLM:

 from transformers import AutoTokenizer, AutoModelForMaskedLM

 model_id = "answerdotai/ModernBERT-base"
 tokenizer = AutoTokenizer.from_pretrained(model_id)
 model = AutoModelForMaskedLM.from_pretrained(model_id)

 text = "The capital of France is [MASK]."
 inputs = tokenizer(text, return_tensors="pt")
 outputs = model(**inputs)

 To get predictions for the mask:

 masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id)
 predicted_token_id = outputs.logits[0, masked_index].argmax(axis=-1)
 predicted_token = tokenizer.decode(predicted_token_id)
 print("Predicted token:", predicted_token)

 Predicted token:  Paris

 Using a pipeline:

 import torch
 from transformers import pipeline
 from pprint import pprint

 pipe = pipeline(
 "fill-mask",
 model="answerdotai/ModernBERT-base",
 torch_dtype=torch.bfloat16,
 )

 input_text = "He walked to the [MASK]."
 results = pipe(input_text)
 pprint(results)

 Note: ModernBERT does not use token type IDs, unlike some earlier BERT models. Most downstream usage is identical to standard BERT models on the Hugging Face Hub, except you can omit the token_type_ids parameter.

 Evaluation
 We evaluate ModernBERT across a range of tasks, including natural language understanding (GLUE), general retrieval (BEIR), long-context retrieval (MLDR), and code retrieval (CodeSearchNet and StackQA).

 Key highlights:

 On GLUE, ModernBERT-base surpasses other similarly-sized encoder models, and ModernBERT-large is second only to Deberta-v3-large.
 For general retrieval tasks, ModernBERT performs well on BEIR in both single-vector (DPR-style) and multi-vector (ColBERT-style) settings.
 Thanks to the inclusion of code data in its training mixture, ModernBERT as a backbone also achieves new state-of-the-art code retrieval results on CodeSearchNet and StackQA.
 Base Models
 Model	IR (DPR)	IR (DPR)	IR (DPR)	IR (ColBERT)	IR (ColBERT)	NLU	Code	Code
 BEIR	MLDR_OOD	MLDR_ID	BEIR	MLDR_OOD	GLUE	CSN	SQA
 BERT	38.9	23.9	32.2	49.0	28.1	84.7	41.2	59.5
 RoBERTa	37.7	22.9	32.8	48.7	28.2	86.4	44.3	59.6
 DeBERTaV3	20.2	5.4	13.4	47.1	21.9	88.1	17.5	18.6
 NomicBERT	41.0	26.7	30.3	49.9	61.3	84.0	41.6	61.4
 GTE-en-MLM	41.4	34.3	44.4	48.2	69.3	85.6	44.9	71.4
 ModernBERT	41.6	27.4	44.0	51.3	80.2	88.4	56.4	73.6
 Large Models
 Model	IR (DPR)	IR (DPR)	IR (DPR)	IR (ColBERT)	IR (ColBERT)	NLU	Code	Code
 BEIR	MLDR_OOD	MLDR_ID	BEIR	MLDR_OOD	GLUE	CSN	SQA
 BERT	38.9	23.3	31.7	49.5	28.5	85.2	41.6	60.8
 RoBERTa	41.4	22.6	36.1	49.8	28.8	88.9	47.3	68.1
 DeBERTaV3	25.6	7.1	19.2	46.7	23.0	91.4	21.2	19.7
 GTE-en-MLM	42.5	36.4	48.9	50.7	71.3	87.6	40.5	66.9
 ModernBERT	44.0	34.3	48.6	52.4	80.4	90.4	59.5	83.9
 Table 1: Results for all models across an overview of all tasks. CSN refers to CodeSearchNet and SQA to StackQA. MLDRID refers to in-domain (fine-tuned on the training set) evaluation, and MLDR_OOD to out-of-domain.

 ModernBERT’s strong results, coupled with its efficient runtime on long-context inputs, demonstrate that encoder-only models can be significantly improved through modern architectural choices and extensive pretraining on diversified data sources.

 Limitations
 ModernBERT’s training data is primarily English and code, so performance may be lower for other languages. While it can handle long sequences efficiently, using the full 8,192 tokens window may be slower than short-context inference. Like any large language model, ModernBERT may produce representations that reflect biases present in its training data. Verify critical or sensitive outputs before relying on them.

 Training
 Architecture: Encoder-only, Pre-Norm Transformer with GeGLU activations.
 Sequence Length: Pre-trained up to 1,024 tokens, then extended to 8,192 tokens.
 Data: 2 trillion tokens of English text and code.
 Optimizer: StableAdamW with trapezoidal LR scheduling and 1-sqrt decay.
 Hardware: Trained on 8x H100 GPUs.
 See the paper for more details.
 <model_to_finetune>

 <data_to_use>
 Datasets:

 ssmits
 /
 fineweb-2-dutch

 like
 0
 Modalities:
 Tabular
 Text
 Formats:
 parquet
 Size:
 10M - 100M
 Libraries:
 Datasets
 Dask

 Croissant

 1
 Dataset card
 Viewer
 Files
 Community
 1
 Dataset Viewer
 Auto-converted to Parquet
 API
 Embed
 Full Screen Viewer
 Split (1)
 train
 ·
 83.5M rows

 train (83.5M rows)
 Search is not available for this dataset

 SQL
 Console
 text
 stringlengths
 22063.4k99.9%
 id
 stringlengths
 47100%
 dump
 stringclasses
 CC-MAIN-2013-200.4%
 url
 stringlengths
 1424499.9%
 date
 stringlengths
 20100%
 file_path
 stringlengths
 1371413.6%
 language
 stringclasses
 nld100%
 language_score
 float64
 0.720.75<0.1%
 language_script
 stringclasses
 Latn100%
 minhash_cluster_size
 int64
 1380k100%
 top_langs
 stringlengths
 831360.2%
 Text here
 <data_to_use>

 <prompt>Modify the finetuning script in order to fine-tune ModernBERT on fineweb-2-dutch; analyze if data should be packed or not, and what if its >8192 tokens then split. But maybe its not good if two different subjects are collated together or is it? Do the embeddings get 'confused' or is this generally more efficient. Save every 1 million tokens</prompt>