An easy to follow guide to fine-tune your first HuggingFace model
The HuggingFace Model Hub is a warehouse of a myriad of state-of-the-art Machine Learning for NLP, image and audio. The massive community downstreams these models by means of fine-tuning to fit their specific use-case. Developers with limited domain knowledge in ML leverage these models in their projects through an API, thereby abstracting the entire process at every step of the way.
In this example, we shall fine-tune our own summarizer with a few lines of code. This guide assumes that you have basic understanding of NLP such as tokenization, and idea about terms such as fine-tuning, etc. in Deep Learning. The code is present here.
Libraries
First, install the libraries: transformers, datasets, rouge_score, and tensorflow or pytorch.
Dataset Gathering
To start, we need a dataset to train our model to perform summarization. In this instance we’ll use the ‘arXiv Dataset’ from Kaggle linked here. Our dataset consists of several fields- but we are concerned with the ‘title’ and ‘abstract’ fields. Where given an abstract of a paper, our model generates a one-liner summary.
Download it, and copy its path (for example: /kaggle/input/arxiv/arxiv-metadata-oai-snapshot.json) As you may have noticed, our data is in JSON format, and that may seem to be a hassle for some, but fortunately the ‘load_dataset’ function from the ‘datasets’ library parses the JSON** into an arrow table. Datasets library is also developed by HuggingFace which comes preloaded with several datasets for a variety of tasks, read more here.
Tokenize and Preprocess
We can additionally perform train-test split on our dataset using the split parameter. Our next objective is to preprocess the text, wherein we tokenize the ‘abstract’ field as our inputs and ‘title’ as labels. Before that, let’s understand the AutoTokenizer class — it performs preprocessing such as tokenization, normalization, conversion of tokens to bag of words and other operations specific to each model. It converts the text into a numeric form relevant to the model, which it further processes to predict the correct label or generate text. It also adds an attention mask, where mask with ones indicates tokens to attend to, zeros for tokens to ignore.
We instantiate a tokenizer with the AutoTokenizer.from_pretrained() method. We pass the name of a model which we will use for summarization, which, in our case is ‘t5-small’. We now define a preprocess function, this prefixes ‘summarize: ‘ to our abstracts and assign this to our input variable, this is done to prompt our ‘t5-small’ model that it is a summarization task. To the tokenizer object, we pass the inputs as parameters and keyword arguments such as max_length to add padding to documents with less than 1024 tokens and truncation = True to slice documents having over 1024 tokens. This is done in order to create vectors of length 1024 at all times. Next, set the context manager with the as_target_tokenizer() method, this temporarily sets the tokenizer to encode the targets. This is particularly useful when we have Seq2Seq models (sequence to sequence: input is a sequence of text, and model also generates another sequence of text as output), that need a slightly different preprocessing for the labels. We also have to assign the tokenized labels, i.e., labels[‘input_ids’] to the ‘labels’ field of the tokenized inputs. This is done since our model accepts inputs and labels in this form. Now apply our preprocess function onto the vals and trains dataset using the map method, and for providing data to the preprocess function in batches, batched keyword argument is set to True.
Data Collation
We use the DataCollator (in our case DataCollatorForSeq2Seq) object to create a batch of data, it also dynamically pads (set to True by default) sentences of each batch in accordance to its longest token array. The tokenizer function also pads the data, while it is not as efficient as dynamic padding. For Pytorch, omit the return_tensor argument, or set it to ‘pt’. Now convert the train and val dataset to tf.data.Dataset format using to_tf_dataset function, specify the required columns: ‘attention_mask’, ‘input_ids’, ‘labels’. You set batch_size, shuffle to True or False, and the collator object.
Fit and Train
Now that, we’ve made all the required transformations to accommodate the data into a TensorFlow model, we instantiate our pre-trained model from the TFAutoModelForSeq2SeqLM class, pass the model’s name (t5-small) to the from_pretrained function to inherit the config and weights from the prerained model. And instantiate an AdamWeightDecay optimizer, set the learning_rate and weight_decay_rate. Call the model.compile() function and pass the optimizer as argument. Lastly, call the model.fit() function and pass the train set, val set and specify epochs as well.
For Pytorch, instantiate a AutoModelForSeq2SeqLM() and pass the name of the model. Set hyperparameters such as learning_rate, epochs, weight_decay, train and val batch size, etc by instantiating an Seq2SeqTrainingArguments() object. Instantiate a trainer from the Seq2SeqTrainer(), pass the model, the training args, train dataset, val dataset, tokenizer, datacollator. And finally call the trainer.train() function to begin training.
Metrics
We consider a candidate collection of 100 abstracts from our datatset pass it through the tokenizer, have the model generate output tokens from the input_ids, the tokenizer decodes the output tokens, leaving out the special tokens like the beggining-of-sentence, mask, etc. In order, to evaluate the performance of the model, we call the load_metric function with our Rouge metric parameter. We create an evaluation function that accepts the summary generated by our model and the actual labels. The metric.compute() function accepts the actual labels, summary generated to compute the rouge score, based on the stemmed form of the text, if use_stemmer is True.
Conclusion
We have consequently created a summarizer model, and in a similar fashion can create models for Text Classification, Token Classification, Question Answering, Language Modelling, Translation, etc. I urge you to follow the transformers documentation to try out more tasks and other models as well.
[** Note that in our case the JSON file contains several JSON objects instead of an array of JSON objects. In such cases, specify the field argument (read more)]