from IPython.display import HTML
"""
HTML(<iframe
src="https://mrdbourke-learn-hf-food-not-food-text-classifier-demo.hf.space"
frameborder="0"
width="850"
height="650"
></iframe>
""")
Text Classification with Hugging Face Transformers Tutorial
Source code on GitHub | Online book version | Setup guide | Video Course (step by step walkthrough)
1 Overview
Welcome to the Learn Hugging Face Text Classificaiton project!
This tutorial is hands-on and focused on writing resuable code.
We’ll start with a text dataset, build a model to classify text samples and then share our model as a demo others can use.
To do so, we’ll be using a handful of helpful open-source tools from the Hugging Face ecosystem.
Feel to keep reading through the notebook but if you’d like to run the code yourself, be sure to go through the setup guide first.
1.1 What we’re going to build
We’re going to be bulding a food
/not_food
text classification model.
Given a piece of a text (such as an image caption), our model will be able to predict if it’s about food or not.
This is the same kind of model I use in my own work on Nutrify (an app to help people learn about food).
More specifically, we’re going to follow the following steps:
- Data: Problem defintion and dataset preparation - Getting a dataset/setting up the problem space.
- Model: Finding, training and evaluating a model - Finding a text classification model suitable for our problem on Hugging Face and customizing it to our own dataset.
- Demo: Creating a demo and put our model into the real world - Sharing our trained model in a way others can access and use.
By the end of this project, you’ll have a trained model and demo on Hugging Face you can share with others:
Note this is a hands-on project, so we’ll be focused on writing reusable code and building a model that can be used in the real world. If you are looking for explainers to the theory of what we’re doing, I’ll leave links in the extra-curriculum section.
1.2 What is text classification?
Text classification is the process of assigning a category to a piece of text.
Where a category can be almost anything and a piece of text can be a word, phrase, sentence, paragraph or entire document.
Example text classification problems include:
Problem | Description | Problem Type |
---|---|---|
Spam/phishing email detection | Is an email spam or not spam? Or is it a phishing email or not? | Binary classification (one thing or another) |
Sentiment analysis | Is a piece of text positive, negative or neutral? Such as classifying product reviews into good/bad/neutral. | Multi-class classification (one thing from many) |
Language detection | What language is a piece of text written in? | Multi-class classification (one thing from many) |
Topic classification | What topic(s) does a news article belong to? | Multi-label classification (one or more things from many) |
Hate speech detection | Is a comment hateful or not hateful? | Binary classification (one thing or another) |
Product categorization | What categories does a product belong to? | Multi-label classification (one or more things from many) |
Business email classification | Which category should this email go to? | Multi-class classification (one thing from many) |
Text classification is a very common problem in many business settings.
For example, a project I’ve worked on previously as a machine learning engineer was building a text classification model to classify different insurance claims into claimant_at_fault
/claimant_not_at_fault
for a large insurance company.
It turns out the deep learning-based model we built was very good (98%+ accuracy on the test dataset).
Speaking of models, there are several different kinds of models you can use for text classification.
And each will have its pros and cons depending on the problem you’re working on.
Example text classification models include:
Model | Description | Pros | Cons |
---|---|---|---|
Rule-based | Uses a set of rules to classify text (e.g. if text contains “sad” -> sentiment = low) | Simple, easy to understand | Requires manual creation of rules |
Bag of Words | Counts the frequency of words in a piece of text | Simple, easy to understand | Doesn’t capture word order |
TF-IDF | Weighs the importance of words in a piece of text | Simple, easy to understand | Doesn’t capture word order |
Deep learning-based models | Uses neural networks to learn patterns in text | Can learn complex patterns at scale | Can require large amounts of data/compute power to run, not as easy to understand (can be hard to debug) |
For our project, we’re going to go with a deep learning model.
Why?
Because Hugging Face helps us do so.
And in most cases, with a quality dataset, a deep learning model will often perform better than a rule-based or other model.
1.3 Why train your own text classification models?
You can customize pre-trained models for text classification as well as API-powered models and LLMs such as GPT, Gemini, Claude or Mistral.
Depending on your requirements, there are several pros and cons for using your own model versus using an API.
Training/fine-tuning your own model:
Pros | Cons |
---|---|
Control: Full control over model lifecycle. | Can be complex to get setup. |
No usage limits (aside from compute constraints). | Requires dedicated compute resources for training/inference. |
Can train once and deploy everywhere/whenever you want (for example, Tesla deploying a model to all self-driving cars). | Requires maintenance over time to ensure performance remains up to par. |
Privacy: Data can be kept in-house/app and doesn’t need to go to a third party. | Can require longer development cycles compared to using existing APIs. |
Speed: Customizing a small model for a specific use case often means it runs much faster. |
Using a pre-built model API (e.g. GPT, Gemini, Claude, Mistral):
Pros | Cons |
---|---|
Ease of use: often can be setup within a few lines of code. | If the model API goes down, your service goes down. |
No maintenance of compute resources. | Data is required to be sent to a third-party for processing. |
Access to the most advanced models. | The API may have usage limits per day/time period. |
Can scale if usage increases. | Can be much slower than using dedicated models due to requiring an API call. |
For this project, we’re going to focus on fine-tuning our own model.
1.4 Workflow we’re going to follow
Our motto is data, model, demo!
So we’re going to follow the rough workflow of:
- Create and preprocess data.
- Define the model we’d like use with
transformers.AutoModelForSequenceClassification
(or another similar model class). - Define training arguments (these are hyperparameters for our model) with
transformers.TrainingArguments
. - Pass
TrainingArguments
from 3 and target datasets to an instance oftransformers.Trainer
. - Train the model by calling
Trainer.train()
. - Save the model (to our local machine or to the Hugging Face Hub).
- Evaluate the trained model by making and inspecting predctions on the test data.
- Turn the model into a shareable demo.
I say rough because machine learning projects are often non-linear in nature.
As in, because machine learning projects involve many experiments, they can kind of be all over the place.
But this worfklow will give us some good guidelines to follow.
2 Importing necessary libraries
Let’s get started!
First, we’ll import the required libraries.
If you’re running on your local computer, be sure to check out the getting setup guide to make sure you have everything you need.
If you’re using Google Colab, many of them the following libraries will be installed by default.
However, we’ll have to install a few extras to get everything working.
If you’re running on Google Colab, this notebook will work best with access to a GPU. To enable a GPU, go to Runtime
➡️ Change runtime type
➡️ Hardware accelerator
➡️ GPU
.
We’ll need to install the following libraries from the Hugging Face ecosystem:
transformers
- comes pre-installed on Google Colab but if you’re running on your local machine, you can install it viapip install transformers
.datasets
- a library for accessing and manipulating datasets on and off the Hugging Face Hub, you can install it viapip install datasets
.evaluate
- a library for evaluating machine learning model performance with various metrics, you can install it viapip install evaluate
.accelerate
- a library for training machine learning models faster, you can install it viapip install accelerate
.gradio
- a library for creating interactive demos of machine learning models, you can install it viapip install gradio
.
We can also check the versions of our software with package_name.__version__
.
# Install dependencies (this is mostly for Google Colab, as the other dependences are available by default in Colab)
try:
import datasets, evaluate, accelerate
import gradio as gr
except ModuleNotFoundError:
!pip install -U datasets evaluate accelerate gradio # -U stands for "upgrade" so we'll get the latest version by default
import datasets, evaluate, accelerate
import gradio as gr
import random
import numpy as np
import pandas as pd
import torch
import transformers
print(f"Using transformers version: {transformers.__version__}")
print(f"Using datasets version: {datasets.__version__}")
print(f"Using torch version: {torch.__version__}")
Using transformers version: 4.43.2
Using datasets version: 2.20.0
Using torch version: 2.4.0+cu121
Wonderful, as long as your versions are the same or higher to the versions above, you should be able to run the code below.
3 Getting a dataset
Okay, now we’re got the required libraries, let’s get a dataset.
Getting a dataset is one of the most important things a machine learning project.
The dataset you often determines the type of model you use as well as the quality of the outputs of that model.
Meaning, if you have a high quality dataset, chances are, your future model could also have high quality outputs.
It also means if your dataset is of poor quality, your model will likely also have poor quality outputs.
For a text classificaiton problem, your dataset will likely come in the form of text (e.g. a paragraph, sentence or phrase) and a label (e.g. what category the text belongs to).
In our case, our dataset comes in the form of a collection of synthetic image captions and their corresponding labels (food or not food).
This is a dataset I’ve created earlier to help us practice building a text classification model.
You can find it on Hugging Face under the name mrdbourke/learn_hf_food_not_food_image_captions
.
You can see how the Food Not Food image caption dataset was created in the example Google Colab notebook.
A Large Language Model (LLM) was asked to generate various image caption texts about food and not food.
Getting another model to create data for a problem is known as synthetic data generation and is a very good way of bootstrapping towards creating a model.
One workflow would be to use real data wherever possible and use synthetic data to boost when needed.
Note that it’s always advised to evaluate/test models on real-life data as opposed to synthetic data.
3.1 Where can you get more datasets?
The are many different places you can get datasets for text-based problems.
One of the best places is on the Hugging Face Hub, specifically huggingface.co/datasets.
Here you can find many different kinds of problem specific data such as text classification.
There are also many more datasets available on Kaggle Datasets.
And thanks to the power of LLMs (Large Language Models), you can also now create your own text classifications by generating samples (this is how I created the dataset for this project).
3.2 Loading the dataset
Once we’ve found/prepared a dataset on the Hugging Face Hub, we can use the Hugging Face datasets
library to load it.
To load a dataset we can use the datasets.load_dataset(path=NAME_OR_PATH_OF_DATASET)
function and pass it the name/path of the dataset we want to load.
In our case, our dataset name is mrdbourke/learn_hf_food_not_food_image_captions
(you can also change this for your own dataset).
And since our dataset is hosted on Hugging Face, when we run the following code for the first time, it will download it.
If your target dataset is quite large, this download may take a while.
However, once the dataset is downloaded, subsequent reloads will be mush faster.
# Load the dataset from Hugging Face Hub
= datasets.load_dataset(path="mrdbourke/learn_hf_food_not_food_image_captions")
dataset
# Inspect the dataset
dataset
DatasetDict({
train: Dataset({
features: ['text', 'label'],
num_rows: 250
})
})
Dataset loaded!
Looks like our dataset has two features, text
and label
.
And 250 total rows (the number of examples in our dataset).
We can check the column names with dataset.column_names
.
# What features are there?
dataset.column_names
{'train': ['text', 'label']}
Looks like our dataset comes with a train
split already (the whole dataset).
We can access the train
split with dataset["train"]
(some datasets also come with built-in "test"
splits too).
# Access the training split
"train"] dataset[
Dataset({
features: ['text', 'label'],
num_rows: 250
})
How about we check out a single sample?
We can do so with indexing.
"train"][0] dataset[
{'text': 'Creamy cauliflower curry with garlic naan, featuring tender cauliflower in a rich sauce with cream and spices, served with garlic naan bread.',
'label': 'food'}
Nice! We get back a dictionary with the keys text
and label
.
The text
key contains the text of the image caption and the label
key contains the label (food or not food).
3.3 Inspect random examples from the dataset
At 250 total samples, our dataset isn’t too large.
So we could sit here and explore the samples one by one.
But whenever I interact with a new dataset, I like to view a bunch of random examples and get a feel of the data.
Doing so is inline with the data explorer’s motto: visualize, visualize, visualize!
As a rule of thumb, I like to view at least 20-100 random examples when interacting with a new dataset.
Let’s write some code to view 5 random indexes of our data and their corresponding text and labels at a time.
import random
= random.sample(range(len(dataset["train"])), 5)
random_indexs = dataset["train"][random_indexs]
random_samples
print(f"[INFO] Random samples from dataset:\n")
for item in zip(random_samples["text"], random_samples["label"]):
print(f"Text: {item[0]} | Label: {item[1]}")
[INFO] Random samples from dataset:
Text: Set of spatulas kept in a holder | Label: not_food
Text: Mouthwatering paneer tikka masala, featuring juicy paneer in a rich tomato-based sauce, garnished with fresh coriander leaves. | Label: food
Text: Pair of reading glasses left open on a book | Label: not_food
Text: Set of board games stacked on a shelf | Label: not_food
Text: Two handfuls of bananas in a fruit bowl with grapes on the side, the fruit bowl is blue | Label: food
Beautiful! Looks like our data contains a mix of shorter and longer sentences (between 5 and 20 words) of texts about food and not food.
We can get the unique labels in our dataset with dataset["train"].unique("label")
.
# Get unique label values
"train"].unique("label") dataset[
['food', 'not_food']
If our dataset is small enough to fit into memory, we can count the number of different labels with Python’s collections.Counter
(a method for counting objects in an iterable or mapping).
# Check number of each label
from collections import Counter
"train"]["label"]) Counter(dataset[
Counter({'food': 125, 'not_food': 125})
Excellent, looks like our dataset is well balanced with 125 samples of food and 125 samples of not food.
In a binary classification case, this is ideal.
If the classes were dramatically unbalanced (e.g. 90% food and 10% not food) we might have to consider collecting/creating more data.
But best to train a model and see how it goes before making any drastic dataset changes.
Because our dataset is small, we could also inspect it via a pandas DataFrame (however, this may not be possible for extremely large datasets).
# Turn our dataset into a DataFrame and get a random sample
= pd.DataFrame(dataset["train"])
food_not_food_df 7) food_not_food_df.sample(
text | label | |
---|---|---|
142 | A slice of pizza with a generous amount of shr... | food |
6 | Pair of reading glasses left open on a book | not_food |
97 | Telescope positioned on a balcony | not_food |
60 | A close-up of a family playing a board game wi... | not_food |
112 | Rich and spicy lamb rogan josh with yogurt gar... | food |
181 | A steaming bowl of fiery chicken curry, infuse... | food |
197 | Pizza with a stuffed crust, oozing with cheese | food |
# Get the value counts of the label column
"label"].value_counts() food_not_food_df[
label
food 125
not_food 125
Name: count, dtype: int64
4 Preparing data for text classification
We’ve got our data ready but there are a few steps we’ll need to take before we can model it.
The main two being:
- Tokenization - turning our text into a numerical representation (machines prefer numbers rather than words), for example,
{"a": 0, "b": 1, "c": 2...}
. - Creating a train/test split - right now our data is in a training split only but we’ll create a test set to evaluate our model’s performance.
These don’t necessarily have to be in order either.
Before we get to them, let’s create a small mapping from our labels to numbers.
In the same way we need to tokenize our text into numerical representation, we also need to do the same for our labels.
4.1 Creating a mapping from labels to numbers
Our machine learning model will want to see all numbers (people do well with text, computers do well with numbers).
This goes for text as well as label input.
So let’s create a mapping from our labels to numbers.
Since we’ve only got a couple of labels ("food"
and "not_food"
), we can create a dictionary to map them to numbers, however, if you’ve got a fair few labels, you may want to make this mapping programmatically.
We can use these dictionaries later on for our model training as well as evaluation.
# Create mapping from id2label and label2id
= {0: "not_food", 1: "food"}
id2label = {"not_food": 0, "food": 1}
label2id
print(f"Label to ID mapping: {label2id}")
print(f"ID to Label mapping: {id2label}")
Label to ID mapping: {'not_food': 0, 'food': 1}
ID to Label mapping: {0: 'not_food', 1: 'food'}
In a binary classification task (such as what we’re working on), the positive class, in our case "food"
, is usually given the label 1
and the negative class ("not_food"
) is given the label 0
.
Rather than hard-coding our label to ID maps, we can also create them programmatically from the dataset (this is helpful if you have many classes).
# Create mappings programmatically from dataset
= {idx: label for idx, label in enumerate(dataset["train"].unique("label")[::-1])} # reverse sort list to have "not_food" first
id2label = {label: idx for idx, label in id2label.items()}
label2id
print(f"Label to ID mapping: {label2id}")
print(f"ID to Label mapping: {id2label}")
Label to ID mapping: {'not_food': 0, 'food': 1}
ID to Label mapping: {0: 'not_food', 1: 'food'}
With our dictionary mappings created, we can update the labels of our dataset to be numeric.
We can do this using the datasets.Dataset.map
method and passing it a function to apply to each example.
Let’s create a small function which turns an example label into a number.
# Turn labels into 0 or 1 (e.g. 0 for "not_food", 1 for "food")
def map_labels_to_number(example):
"label"] = label2id[example["label"]]
example[return example
= {"text": "This is a sentence about my favourite food: honey.", "label": "food"}
example_sample
# Test the function
map_labels_to_number(example_sample)
{'text': 'This is a sentence about my favourite food: honey.', 'label': 1}
Looks like our function works!
How about we map it to the whole dataset?
# Map our dataset labels to numbers
= dataset["train"].map(map_labels_to_number)
dataset 5] dataset[:
{'text': ['Creamy cauliflower curry with garlic naan, featuring tender cauliflower in a rich sauce with cream and spices, served with garlic naan bread.',
'Set of books stacked on a desk',
'Watching TV together, a family has their dog stretched out on the floor',
'Wooden dresser with a mirror reflecting the room',
'Lawn mower stored in a shed'],
'label': [1, 0, 0, 0, 0]}
Nice! Looks like our labels are all numerical now.
We can check a few random samples using dataset.shuffle()
and indexing for the first few.
# Shuffle the dataset and view the first 5 samples (will return different results each time)
5] dataset.shuffle()[:
{'text': ['Set of oven mitts hanging on a hook',
'Set of cookie cutters collected in a jar',
'Pizza with a dessert twist, featuring a sweet Nutella base and fresh strawberries on top',
'Set of binoculars placed on a table',
'Two handfuls of bananas in a fruit bowl with grapes on the side, the fruit bowl is blue'],
'label': [0, 0, 1, 0, 1]}
4.2 Split the dataset into training and test sets
Right now our dataset only has a training split.
However, we’d like to create a test split so we can evaluate our model.
In essence, our model will learn patterns (the relationship between text captions and their labels of food/not_food) on the training data.
And we will evaluate those learned patterns on the test data.
We can split our data using the datasets.Dataset.train_test_split
method.
We can use the test_size
parameter to define the percentage of data we’d like to use in our test set (e.g. test_size=0.2
would mean 20% of the data goes to the test set).
# Create train/test splits
= dataset.train_test_split(test_size=0.2, seed=42) # note: seed isn't needed, just here for reproducibility, without it you will get different splits each time you run the cell
dataset dataset
DatasetDict({
train: Dataset({
features: ['text', 'label'],
num_rows: 200
})
test: Dataset({
features: ['text', 'label'],
num_rows: 50
})
})
Perfect!
Our dataset has been split into 200 training examples and 50 testing examples.
Let’s visualize a few random examples to make sure they still look okay.
= random.randint(0, len(dataset["train"]))
random_idx_train = dataset["train"][random_idx_train]
random_sample_train
= random.randint(0, len(dataset["test"]))
random_idx_test = dataset["test"][random_idx_test]
random_sample_test
print(f"[INFO] Random sample from training dataset:")
print(f"Text: {random_sample_train['text']}\nLabel: {random_sample_train['label']} ({id2label[random_sample_train['label']]})\n")
print(f"[INFO] Random sample from testing dataset:")
print(f"Text: {random_sample_test['text']}\nLabel: {random_sample_test['label']} ({id2label[random_sample_test['label']]})")
[INFO] Random sample from training dataset:
Text: Set of dumbbells stacked in a gym
Label: 0 (not_food)
[INFO] Random sample from testing dataset:
Text: Two handfuls of bananas in a fruit bowl with grapes on the side, the fruit bowl is blue
Label: 1 (food)
4.3 Tokenizing text data
Labels numericalized, dataset split, time to turn our text into numbers.
How?
Tokenization.
What’s tokenization?
Tokenization is the process of converting a non-numerical data source into numbers.
Why?
Because machines (especially machine learning models) prefer numbers to human-style data.
In the case of the text "I love pizza"
a very simple method of tokenization might be to convert each word to a number.
For example, {"I": 0, "love": 1, "pizza": 2}
.
However, for most modern machine learning models, the tokenization process is a bit more nuanced.
For example, the text "I love pizza"
might be tokenized into something more like [101, 1045, 2293, 10733, 102]
.
Depending on the model you use, the tokenization process could be different.
For example, one model might turn "I love pizza"
into [40, 3021, 23317]
, where as another model might turn it into [101, 1045, 2293, 10733, 102]
.
To deal with this, Hugging Face models often pair models and tokenizers together by name.
Such is the case with distilbert/distilbert-base-uncased
(there is a tokenizer.json
file as well as a tokenizer_config.json
file which contains all of the tokenizer implementation details).
For more examples of tokenization, you can see OpenAI’s tokenization visualizer tool as well as their open-source library tiktoken
, Google also have an open-source tokenization library called sentencepiece
, finally Hugging Face’s tokenizers
library is also a great resource (this is what we’ll be using behind the scenes).
Many of the text-based models on Hugging Face come paired with their own tokenizer.
For example, the distilbert/distilbert-base-uncased
model is paired with the distilbert/distilbert-base-uncased
tokenizer.
We can load the tokenizer for a given model using the transformers.AutoTokenizer.from_pretrained
method and passing it the name of the model we’d like to use.
The transformers.AutoTokenizer
class is part of a series of Auto Classes (such as AutoConfig
, AutoModel
, AutoProcessor
) which automatically loads the correct configuration settings for a given model ID.
Let’s load the tokenizer for the distilbert/distilbert-base-uncased
model and see how it works.
Why use the distilbert/distilbert-base-uncased
model?
The short answer is that I’ve used it before and it works well (and fast) on various text classification tasks.
It also performed well in the original research paper which introduced it.
The longer answer is that Hugging Face has many available open-source models for many different problems available at https://huggingface.co/models.
Navigating these models can take some practice.
And several models may be suited for the same task (though with various tradeoffs such as size and speed).
However, overtime and with adequate experimentation, you’ll start to build an intuition on which models are good for which problems.
from transformers import AutoTokenizer
= AutoTokenizer.from_pretrained(pretrained_model_name_or_path="distilbert/distilbert-base-uncased",
tokenizer =True) # uses fast tokenization (backed by tokenziers library and implemented in Rust) by default, if not available will default to Python implementation
use_fast
tokenizer
DistilBertTokenizerFast(name_or_path='distilbert/distilbert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True), added_tokens_decoder={
0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
Nice!
There’s our tokenizer!
It’s an instance of the transformers.DistilBertTokenizerFast
class.
You can read more about it in the documentation.
For now, let’s try it out by passing it a string of text.
# Test out tokenizer
"I love pizza") tokenizer(
{'input_ids': [101, 1045, 2293, 10733, 102], 'attention_mask': [1, 1, 1, 1, 1]}
# Try adding a "!" at the end
"I love pizza!") tokenizer(
{'input_ids': [101, 1045, 2293, 10733, 999, 102], 'attention_mask': [1, 1, 1, 1, 1, 1]}
Woohoo!
Our text gets turned into numbers (or tokens).
Notice how with even a slight change in the text, the tokenizer produces different results?
The input_ids
are our tokens.
And the attention_mask
(in our case, all [1, 1, 1, 1, 1, 1]
) is a mask which tells the model which tokens to use or not.
Tokens with a mask value of 1
get used and tokens with a mask value of 0
get ignored.
There are several attributes of the tokenizer
we can explore.
tokenizer.vocab
will return the vocabulary of the tokenizer or in other words, the unique words/word pieces the tokenizer is capable of converting into numbers.tokenizer.model_max_length
will return the maximum length of a sequence the tokenizer can process, pass anything longer than this and the sequence will be truncated.
# Get the length of the vocabulary
= len(tokenizer.vocab)
length_of_tokenizer_vocab print(f"Length of tokenizer vocabulary: {length_of_tokenizer_vocab}")
# Get the maximum sequence length the tokenizer can handle
= tokenizer.model_max_length
max_tokenizer_input_sequence_length print(f"Max tokenizer input sequence length: {max_tokenizer_input_sequence_length}")
Length of tokenizer vocabulary: 30522
Max tokenizer input sequence length: 512
Woah, looks like our tokenizer has a vocabulary of 30,522
different words and word pieces.
And it can handle a sequence length of up to 512
(any sequence longer than this will be automatically truncated from the end).
Let’s check out some of the vocab.
Can I find my own name?
# Does "daniel" occur in the vocab?
"daniel"] tokenizer.vocab[
3817
Oooh, looks like my name is 3817
in the tokenizer’s vocab.
Can you find your own name? (note: there may be an error if the token doesn’t exist, we’ll get to this)
How about “pizza”?
"pizza"] tokenizer.vocab[
10733
What if a word doesn’t exist in the vocab?
"akash"] tokenizer.vocab[
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) Cell In[26], line 1 ----> 1 tokenizer.vocab["akash"] KeyError: 'akash'
Dam, we get a KeyError
.
Not to worry, this is okay, since when calling the tokenizer
on the word, it will automatically split the word into word pieces or subwords.
"akash") tokenizer(
{'input_ids': [101, 9875, 4095, 102], 'attention_mask': [1, 1, 1, 1]}
It works!
We can check what word pieces "akash"
got broken into with tokenizer.convert_ids_to_tokens(input_ids)
.
"akash").input_ids) tokenizer.convert_ids_to_tokens(tokenizer(
['[CLS]', 'aka', '##sh', '[SEP]']
Ahhh, it seems "akash"
was split into two tokens, ["aka", "##sh"]
.
The "##"
at the start of "##sh"
means that the sequence is part of a larger sequence.
And the "[CLS]"
and "[SEP]"
tokens are special tokens indicating the start and end of a sequence.
Now, since tokenizers can deal with any text, what if there was an unknown token?
For example, rather than "pizza"
someone used the pizza emoji 🍕?
Let’s try!
# Try to tokenize an emoji
"🍕").input_ids) tokenizer.convert_ids_to_tokens(tokenizer(
['[CLS]', '[UNK]', '[SEP]']
Ahh, we get the special "[UNK]"
token.
This stands for “unknown”.
The combination of word pieces and "[UNK]"
special token means that our tokenizer
will be able to turn almost any text into numbers for our model.
Keep in mind that just because one tokenizer uses an unknown special token for a particular word or emoji (🍕) doesn’t mean another will.
Since the tokenizer.vocab
is a Python dictionary, we can get a sample of the vocabulary using tokenizer.vocab.items()
.
How about we get the first 5?
# Get the first 5 items in the tokenizer vocab
sorted(tokenizer.vocab.items())[:5]
[('!', 999), ('"', 1000), ('#', 1001), ('##!', 29612), ('##"', 29613)]
There’s our '!'
from before! Looks like the first five items are all related to punctuation points.
How about a random sample of tokens?
import random
sorted(tokenizer.vocab.items()), k=5) random.sample(
[('##vies', 25929),
('responsibility', 5368),
('##pm', 9737),
('persona', 16115),
('rhythm', 6348)]
4.4 Making a preprocessing function to tokenize text
Rather than tokenizing our texts one by one, it’s best practice to define a preprocessing function which does it for us.
This process works regardless of whether you’re working with text data or other kinds of data such as images or audio.
For any kind of machine learning workflow, an important first step is turning your input data into numbers.
As machine learning models are algorithms which find patterns in numbers, before they can find patterns in your data (text, images, audio, tables) it must be numerically encoded first (e.g. tokenizing text).
To help with this, transformers
has an AutoProcessor
class which can preprocess data in a specific format required for a paired model.
To prepare our text data, let’s create a preprocessing function to take in a dictionary which contains the key "text"
which has the value of a target string (our data samples come in the form of dictionaries) and then returns the tokenized "text"
.
We’ll set the following parameters in our tokenizer
:
padding=True
- This will make all the sequences in a batch the same length by padding shorter sequences with 0’s until they equal the longest size in the batch. Why? If there are different size sequences in a batch, you can sometimes run into dimensionality issues.truncation=True
- This will shorten sequences longer than the model can handle to the model’s max input size (e.g. if a sequence is 1000 long and the model can handle 512, it will be shortened to 512 via removing all tokens after 512).
You can see more parameters available for the tokenizer
in the transformers.PreTrainedTokenizer
documentation.
For more on padding and truncation (two important concepts in sequence processing), I’d recommend reading the Hugging Face documentation on Padding and Truncation.
def tokenize_text(examples):
"""
Tokenize given example text and return the tokenized text.
"""
return tokenizer(examples["text"],
=True, # pad short sequences to longest sequence in the batch
padding=True) # truncate long sequences to the maximum length the model can handle truncation
Wonderful!
Now let’s try it out on an example sample.
= {"text": "I love pizza", "label": 1}
example_sample_2
# Test the function
tokenize_text(example_sample_2)
{'input_ids': [101, 1045, 2293, 10733, 102], 'attention_mask': [1, 1, 1, 1, 1]}
Looking good!
How about we map our tokenize_text
function to our whole dataset
?
We can do so with the datasets.Dataset.map
method.
The map
method allows us to apply a given function to all examples in a dataset.
By setting batched=True
we can apply the given function to batches of examples (many at a time) to speed up computation time.
Let’s create a tokenized_dataset
object by calling map
on our dataset
and passing it our tokenize_text
function.
# Map our tokenize_text function to the dataset
= dataset.map(function=tokenize_text,
tokenized_dataset =True, # set batched=True to operate across batches of examples rather than only single examples
batched=1000) # defaults to 1000, can be increased if you have a large dataset
batch_size
tokenized_dataset
DatasetDict({
train: Dataset({
features: ['text', 'label', 'input_ids', 'attention_mask'],
num_rows: 200
})
test: Dataset({
features: ['text', 'label', 'input_ids', 'attention_mask'],
num_rows: 50
})
})
Dataset tokenized!
Let’s inspect a pair of samples.
# Get two samples from the tokenized dataset
= tokenized_dataset["train"][0]
train_tokenized_sample = tokenized_dataset["test"][0]
test_tokenized_sample
for key in train_tokenized_sample.keys():
print(f"[INFO] Key: {key}")
print(f"Train sample: {train_tokenized_sample[key]}")
print(f"Test sample: {test_tokenized_sample[key]}")
print("")
[INFO] Key: text
Train sample: Set of headphones placed on a desk
Test sample: A slice of pepperoni pizza with a layer of melted cheese
[INFO] Key: label
Train sample: 0
Test sample: 1
[INFO] Key: input_ids
Train sample: [101, 2275, 1997, 2132, 19093, 2872, 2006, 1037, 4624, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Test sample: [101, 1037, 14704, 1997, 11565, 10698, 10733, 2007, 1037, 6741, 1997, 12501, 8808, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[INFO] Key: attention_mask
Train sample: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Test sample: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Beautiful! Our samples have been tokenized.
Notice the zeroes on the end of the inpud_ids
and attention_mask
values.
These are padding tokens to ensure that each sample has the same length as the longest sequence in a given batch.
We can now use these tokenized samples later on in our model.
4.5 Tokenization takeaways
We’ve now seen and used tokenizers in practice.
A few takeaways before we start to build a model:
- Tokenizers are used to turn text (or other forms of data such as images and audio) into a numerical representation ready to be used with a machine learning model.
- Many models reuse existing tokenizers and many models have their own specific tokenizer paired with them. Hugging Face’s
transformers.AutoTokenizer
,transformers.AutoProcessor
andtransformers.AutoModel
classes make it easy to pair tokenizers and models based on their name (e.g.distilbert/distilbert-base-uncased
).
5 Setting up an evaluation metric
Aside from training a model, one of the most important steps in machine learning is evaluating a model.
To do, we can use evaluation metrics.
An evaluation metric attempts to represent a model’s performance in a single (or series) of numbers (note, I say “attempts” here because evaluation metrics are useful to guage performance but the real test of a machine learning model is in the real world).
There are many different kinds of evaluation metrics for various problems.
But since we’re focused on text classification, we’ll use accuracy as our evaluation metric.
A model which gets 99/100 predictions correct has an accuracy of 99%.
\[ \text{Accuracy} = \frac{\text{correct classifications}}{\text{all classifications}} \]
For some projects, you may have a minimum standard of a metric.
For example, when I worked on an insurance claim classification model, the clients required over 98% accuracy on the test dataset for it to be viable to use in production.
If needed, we can craft these evaluation metrics ourselves.
However, Hugging Face has a library called evaluate
which has various metrics built in ready to use.
We can load a metric using evaluate.load("METRIC_NAME")
.
Let’s load in "accuracy"
and build a function to measure accuracy by comparing arrays of predictions and labels.
import evaluate
import numpy as np
from typing import Tuple
= evaluate.load("accuracy")
accuracy_metric
def compute_accuracy(predictions_and_labels: Tuple[np.array, np.array]):
"""
Computes the accuracy of a model by comparing the predictions and labels.
"""
= predictions_and_labels
predictions, labels
# Get highest prediction probability of each prediction if predictions are probabilities
if len(predictions.shape) >= 2:
= np.argmax(predictions, axis=1)
predictions
return accuracy_metric.compute(predictions=predictions, references=labels)
Accuracy function created!
Now let’s test it out.
# Create example list of predictions and labels
= np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
example_predictions_all_correct = np.array([0, 0, 0, 0, 1, 0, 0, 0, 0, 0])
example_predictions_one_wrong = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
example_labels
# Test the function
print(f"Accuracy when all predictions are correct: {compute_accuracy((example_predictions_all_correct, example_labels))}")
print(f"Accuracy when one prediction is wrong: {compute_accuracy((example_predictions_one_wrong, example_labels))}")
Accuracy when all predictions are correct: {'accuracy': 1.0}
Accuracy when one prediction is wrong: {'accuracy': 0.9}
Excellent, our function works just as we’d like.
When all predictions are correct, it scores 1.0 (or 100% accuracy) and when 9/10 predictions are correct, it returns 0.9 (or 90% accuracy).
We can use this function during training and evaluation of our model.
6 Setting up a model for training
We’ve gone through the important steps of setting data up for training (and evaluation).
Now let’s prepare a model.
We’ll keep going through the following steps:
- ✅ Create and preprocess data.
- Define the model we’d like use with
transformers.AutoModelForSequenceClassification
(or another similar model class). - Define training arguments (these are hyperparameters for our model) with
transformers.TrainingArguments
. - Pass
TrainingArguments
from 3 and target datasets to an instance oftransformers.Trainer
. - Train the model by calling
Trainer.train()
. - Save the model (to our local machine or to the Hugging Face Hub).
- Evaluate the trained model by making and inspecting predctions on the test data.
- Turn the model into a shareable demo.
Let’s start by creating an instance of a model.
Since we’re working on text classification, we’ll do so with transformers.AutoModelForSequenceClassification
(where sequence classification means a sequence of something, e.g. our sequences of text).
We can use the from_pretrained()
method to instatiate a pretrained model from the Hugging Face Hub.
The “pretrained” in transformers.AutoModelForSequenceClassification.from_pretrained
means acquiring a model which has already been trained on a certain dataset.
This is common practice in many machine learning projects and is known as transfer learning.
The idea is to take an existing model which works well on a task similar to your target task and then fine-tune it to work even better on your target task.
In our case, we’re going to use the pretrained DistilBERT base model (distilbert/distilbert-base-uncased
) which has been trained on many thousands of books as well as a version of the English Wikipedia (millions of words).
This training gives it a very good baseline representation of the patterns in language.
We’ll take this baseline representation of the patterns in language and adjust it slightly to focus specifically on predicting whether an image caption is about food or not (based on the words it contains).
The main two benefits of using transfer learning are:
- Ability to get good results with smaller amounts of data (since the main representations are learned on a larger dataset, we only have to show the model a few examples of our specific problem).
- This process can be repeated acorss various domains and tasks. For example, you can take a computer vision model trained on millions of images and customize it to your own use case. Or an audio model trained on many different nature sounds and customize it specifically for birds.
So when starting a new machine learning project, one of the first questions you should ask is: does an existing pretrained model similar to my task exist and can I fine-tune it for my own task?
For an end-to-end example of transfer learning in PyTorch (another popular deep learning framework), see PyTorch Transfer Learning.
Time to setup our model
instance.
A few things to note:
- We’ll use
transformers.AutoModelForSequenceClassification.from_pretrained
, this will create the model architecture we specify with thepretrained_model_name_or_path
parameter. - The
AutoModelForSequenceClassification
class comes with a classification head on top of our mdoel (so we can customize this to the number of classes we have with thenum_labels
parameter). - Using
from_pretrained
will also call thetransformers.PretrainedConfig
class which will enable us to setid2label
andlabel2id
parameters for our fine-tuning task.
Let’s refresh what our id2label
and label2id
objects look like.
# Get id and label mappings
print(f"id2label: {id2label}")
print(f"label2id: {label2id}")
id2label: {0: 'not_food', 1: 'food'}
label2id: {'not_food': 0, 'food': 1}
Beautiful, we can pass these mappings to transformers.AutoModelForSequenceClassification.from_pretrained
.
from transformers import AutoModelForSequenceClassification
# Setup model for fine-tuning with classification head (top layers of network)
= AutoModelForSequenceClassification.from_pretrained(
model ="distilbert/distilbert-base-uncased",
pretrained_model_name_or_path=2, # can customize this to the number of classes in your dataset
num_labels=id2label, # mappings from class IDs to the class labels (for classification tasks)
id2label=label2id
label2id )
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Model created!
You’ll notice that a warning message gets displayed:
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: [‘classifier.bias’, ‘classifier.weight’, ‘pre_classifier.bias’, ‘pre_classifier.weight’] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
This is essentially saying “hey, some of the layers in this model are newly initialized (with random patterns) and you should probably customize them to your own dataset”.
This happens because we used the AutoModelForSequenceClassification
class.
Whilst the majority of the layers in our model have already learned patterns from a large corpus of text, the top layers (classifier layers) have been randomly setup so we can customize them on our own.
Let’s try and make a prediction with our model and see what happens.
# Try and make a prediction with the loaded model (this will error)
**tokenized_dataset["train"][0]) model(
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[40], line 2 1 # Try and make a prediction with the loaded model (this will error) ----> 2 model(**tokenized_dataset["train"][0]) File ~/miniconda3/envs/learn_hf/lib/python3.11/site-packages/torch/nn/modules/module.py:1553, in Module._wrapped_call_impl(self, *args, **kwargs) 1551 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc] 1552 else: -> 1553 return self._call_impl(*args, **kwargs) File ~/miniconda3/envs/learn_hf/lib/python3.11/site-packages/torch/nn/modules/module.py:1562, in Module._call_impl(self, *args, **kwargs) 1557 # If we don't have any hooks, we want to skip the rest of the logic in 1558 # this function, and just call forward. 1559 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks 1560 or _global_backward_pre_hooks or _global_backward_hooks 1561 or _global_forward_hooks or _global_forward_pre_hooks): -> 1562 return forward_call(*args, **kwargs) 1564 try: 1565 result = None TypeError: DistilBertForSequenceClassification.forward() got an unexpected keyword argument 'text'
Oh no! We get an error.
Not to worry, this is only because our model hasn’t been trained on our own dataset yet.
Let’s take a look at the layers in our model.
# Inspect the model
model
DistilBertForSequenceClassification(
(distilbert): DistilBertModel(
(embeddings): Embeddings(
(word_embeddings): Embedding(30522, 768, padding_idx=0)
(position_embeddings): Embedding(512, 768)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(transformer): Transformer(
(layer): ModuleList(
(0-5): 6 x TransformerBlock(
(attention): MultiHeadSelfAttention(
(dropout): Dropout(p=0.1, inplace=False)
(q_lin): Linear(in_features=768, out_features=768, bias=True)
(k_lin): Linear(in_features=768, out_features=768, bias=True)
(v_lin): Linear(in_features=768, out_features=768, bias=True)
(out_lin): Linear(in_features=768, out_features=768, bias=True)
)
(sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(ffn): FFN(
(dropout): Dropout(p=0.1, inplace=False)
(lin1): Linear(in_features=768, out_features=3072, bias=True)
(lin2): Linear(in_features=3072, out_features=768, bias=True)
(activation): GELUActivation()
)
(output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
)
)
)
)
(pre_classifier): Linear(in_features=768, out_features=768, bias=True)
(classifier): Linear(in_features=768, out_features=2, bias=True)
(dropout): Dropout(p=0.2, inplace=False)
)
You’ll notice that the model comes in 3 main parts (data flows through these sequentially):
embeddings
- This part of the model turns the input tokens into a learned representation. So rather than just a list of integers, the values become a learned representation. This learned representation comes from the base model learning how different words and word pieces relate to eachother thanks to its training data. The size of(30522, 768)
means the30,522
words in the vocabulary are all represented by vectors of size768
(one word gets represented by 768 numbers, these are often not human interpretable).transformer
- This is the main body of the model. There are severalTransformerBlock
layers stacked on top of each other. These layers attempt to learn a deeper representation of the data going through the model. A thorough breakdown of these layers is beyond the scope of this tutorial, however, for and in-depth guide on Transformer-based models, I’d recommend reading Transformers from scratch by Peter Bloem, going through Andrej Karpathy’s lecture on Transformers and their history or reading the original Attention is all you need paper (this is the paper that introduced the Transformer architecture).classifier
- This is what is going to take the representation of the data and compress it into our number of target classes (noticeout_features=2
, this means that we’ll get two output numbers, one for each of our classes).
For more on the entire DistilBert architecture and its training setup, I’d recommend reading the DistilBert paper from the Hugging Face team.
Rather than breakdown the model itself, we’re focused on using it for a particular task (classifying text).
6.1 Counting the parameters of our model
Before we move into training, we can get another insight into our model by counting its number of parameters.
Let’s create a small function to count the number of trainable (these will update during training) and total parameters in our model.
def count_params(model):
"""
Count the parameters of a PyTorch model.
"""
= sum(p.numel() for p in model.parameters() if p.requires_grad)
trainable_parameters = sum(p.numel() for p in model.parameters())
total_parameters
return {"trainable_parameters": trainable_parameters, "total_parameters": total_parameters}
# Count the parameters of the model
count_params(model)
{'trainable_parameters': 66955010, 'total_parameters': 66955010}
Nice!
Looks like our model has a total of 66,955,010 parameters and all of them are trainable.
A parameter is a numerical value in a model which is capable of being updated to better represent the input data.
I like to think of them as a small opportunity to learn patterns in the data.
If a model has three parameters, it has three small opportunities to learn patterns in the data.
Whereas, if a model has 60,000,000+ (60M) parameters (like our model
), it has 60,000,000+ small opportunities to learn patterns in the data.
Some models such as Large Language Models (LLMs) like Llama 3 70B have 70,000,000,000+ (70B) parameters (over 1000x our model).
In essence, the more parameters a model has, the more opportunities it has to learn (generally).
More parameters often results in more capabilities.
However, more parameters also often results in a much larger model size (e.g. many gigabytes versus hundreds of megabytes) as well as a much longer compute time (fewer samples per second).
For our use case, a binary text classification task, 60M parameters is more than enough.
Why count the parameters in a model?
While it may be tempting to always go with a model that has the most parameters, there are many considerations to take into account before doing so.
What hardware is the model going to run on?
If you need the model to run on cheap hardware, you’ll likely want a smaller model.
How fast do you need the model to be?
If you need 100-1000s of predictions per second, you’ll likely want a smaller model.
“I don’t mind about speed or cost, I just want quality.”
Go with the biggest model you can.
However, often times you can get really good results by training a small model to do a specific task using quality data than by just always using a large model.
6.2 Create a directory for saving models
Training a model can take a while.
So we’ll want a place to save our models.
Let’s create a directory called "learn_hf_food_not_food_text_classifier-distilbert-base-uncased"
(it’s a bit verbose and you can change this if you like but I like to be specific).
# Create model output directory
from pathlib import Path
# Create models directory
= Path("models")
models_dir =True)
models_dir.mkdir(exist_ok
# Create model save name
= "learn_hf_food_not_food_text_classifier-distilbert-base-uncased"
model_save_name
# Create model save path
= Path(models_dir, model_save_name)
model_save_dir
model_save_dir
PosixPath('models/learn_hf_food_not_food_text_classifier-distilbert-base-uncased')
6.3 Setting up training arguments with TrainingArguments
Time to get our model ready for training!
We’re up to step 3 of our process:
- ✅ Create and preprocess data.
- ✅ Define the model we’d like use with
transformers.AutoModelForSequenceClassification
(or another similar model class). - Define training arguments (these are hyperparameters for our model) with
transformers.TrainingArguments
. - Pass
TrainingArguments
from 3 and target datasets to an instance oftransformers.Trainer
. - Train the model by calling
Trainer.train()
. - Save the model (to our local machine or to the Hugging Face Hub).
- Evaluate the trained model by making and inspecting predctions on the test data.
- Turn the model into a shareable demo.
The transformers.TrainingArguments
class contains a series of helpful items, including hyperparameter settings and model saving strategies to use throughout training.
It has many parameters, too many to explain here.
However, the following table breaks down a helpful handful.
Some of the parameters we’ll set are the same as the defaults (this is on purpose as the defaults are often pretty good), some such as learning_rate
are different.
Parameter | Explanation |
---|---|
output_dir |
Name of output directory to save the model and checkpoints to. For example, learn_hf_food_not_food_text_classifier_model . |
learning_rate |
Value of the initial learning rate to use during training. Passed to transformers.AdamW . Initial learning rate because the learning rate can be dynamic during training. The ideal learning is experimental in nature. Defaults to 5e-5 or 0.00001 but we’ll use 0.0001 . |
per_device_train_batch_size |
Size of batches to place on target device during training. For example, a batch size of 32 means the model will look at 32 samples at a time. A batch size too large will result in out of memory issues (e.g. your GPU can’t handle holding a large number of samples in memory at a time). |
per_device_eval_batch_size |
Size of batches to place on target device during evaluation. Can often be larger than during training because no gradients are being calculated. For example, training batch size could be 32 where as evaluation batch size may be able to be 128 (4x larger). Though these are only esitmates. |
num_train_epochs |
Number of times to pass over the data to try and learn patterns. For example, if num_train_epochs=10 , the model will do 10 full passes of the training data. Because we’re working with a small dataset, 10 epochs should be fine to begin with. However, if you had a larger dataset, you may want to do a few experiments using less data (e.g. 10% of the data) for a smaller number of epochs to make sure things work. |
eval_strategy |
When to evaluate the model on the evaluation data. If eval_strategy="epoch" , the model will be evaluated every epoch. See the documentation for more options. Note: This was previously called evaluation_strategy but was shortened in transformers==4.46 . |
save_strategy |
When to save a model checkpoint. If save_strategy="epoch" , a checkpoint will be saved every epoch. See the documentation for more save options. |
save_total_limit |
Number of total amount of checkpoints to save (so we don’t save num_train_epochs checkpoints). For example, can limit to 3 saves so the total number of saves are the 3 most recent as well as the best performing checkpoint (as per load_best_model_at_end ). |
use_cpu |
Set to False by default, will use CUDA GPU (torch.device("cuda") ) or MPS device (torch.device("mps") , for Mac) if available. This is because training is generally faster on an accelerator device. |
seed |
Set to 42 by default for reproducibility. Meaning that subsequent runs with the same setup should achieve the same results. |
load_best_model_at_end |
When set to True , makes sure that the best model found during training is loaded when training finishes. This will mean the best model checkpoint gets saved regardless of what epoch it happened on. This is set to False by default. |
logging_strategy |
When to log the training results and metrics. For example, if logging_strategy="epoch" , results will be logged as outputs every epoch. See the documentation for more logging options. |
report_to |
Log experiments to various experiment tracking services. For example, you can log to Weights & Biases using report_to="wandb" . We’ll turn this off for now and keep logging to a local directory by setting report_to="none" . |
push_to_hub |
Automatically upload the model to the Hugging Face Hub every time the model is saved. We’ll set push_to_hub=False as we’ll see how to do this manually later on. See the documentation for more options on saving models to the Hugging Face Hub. |
hub_token |
Add your Hugging Face Hub token to push a model to the Hugging Face Hub with push_to_hub (will default to huggingface-cli login details). |
hub_private_repo |
Whether or not to make the Hugging Face Hub repository private or public, defaults to False (e.g. set to True if you want the repository to be private). |
To get more familiar with the transformers.TrainingArguments
class, I’d highly recommend reading the documentation for 15-20 minutes. Perhaps over a couple of sessions. There are quite a large number of parameters which will be helpful to be aware of.
Phew!
That was a lot to take in.
But let’s now practice setting up our own instance of transformers.TrainingArguments
.
from transformers import TrainingArguments
print(f"[INFO] Saving model checkpoints to: {model_save_dir}")
# Create training arguments
= TrainingArguments(
training_args =model_save_dir,
output_dir=0.0001,
learning_rate=32,
per_device_train_batch_size=32,
per_device_eval_batch_size=10,
num_train_epochs="epoch", # was previously "evaluation_strategy"
eval_strategy="epoch",
save_strategy=3, # limit the total amount of save checkpoints (so we don't save num_epochs checkpoints)
save_total_limit=False, # set to False by default, will use CUDA GPU or MPS device if available
use_cpu=42, # set to 42 by default for reproducibility
seed=True, # load the best model when finished training
load_best_model_at_end="epoch", # log training results every epoch
logging_strategy="none", # optional: log experiments to Weights & Biases/other similar experimenting tracking services (we'll turn this off for now)
report_to# push_to_hub=True # optional: automatically upload the model to the Hub (we'll do this manually later on)
# hub_token="your_token_here" # optional: add your Hugging Face Hub token to push to the Hub (will default to huggingface-cli login)
=False # optional: make the uploaded model private (defaults to False)
hub_private_repo
)
# Optional: Print out training_args to inspect (warning, it is quite a long output)
# training_args
[INFO] Saving model checkpoints to: models/learn_hf_food_not_food_text_classifier-distilbert-base-uncased
Training arguments created!
Let’s put them to work in an instance of transformers.Trainer
.
6.4 Setting up an instance of Trainer
Time for step 4!
- ✅ Create and preprocess data.
- ✅ Define the model we’d like use with
transformers.AutoModelForSequenceClassification
(or another similar model class). - ✅ Define training arguments (these are hyperparameters for our model) with
transformers.TrainingArguments
. - Pass
TrainingArguments
from 3 and target datasets to an instance oftransformers.Trainer
. - Train the model by calling
Trainer.train()
. - Save the model (to our local machine or to the Hugging Face Hub).
- Evaluate the trained model by making and inspecting predctions on the test data.
- Turn the model into a shareable demo.
The transformers.Trainer
class allows you to train models.
It’s built on PyTorch so it gets to leverage all of the powerful PyTorch toolkit.
But since it also works closely with the transformers.TrainingArguments
class, it offers many helpful features.
transformers.Trainer
can work with torch.nn.Module
models, however, it is designed to work best with transformers.PreTrainedModel
’s from the transformers
library.
This is not a problem for us as we’re using transformers.AutoModelForSequenceClassification.from_pretrained
which loads a transformers.PreTrainedModel
.
See the transformers.Trainer
documentation for tips on how to make sure your model is compatible.
Parameter | Explanation |
---|---|
model |
The model we’d like to train. Works best with an instance of transformers.PreTrainedModel . Most models loaded using from_pretrained will be of this type. |
args |
Instance of transformers.TrainingArguments . We’ll use the training_args object we defined earlier. But if this is not set, it will default to the default settings for transformers.TrainingArguments . |
train_dataset |
Dataset to use during training. We can use our tokenized_dataset["train"] as it has already been preprocessed. |
eval_dataset |
Dataset to use during evaluation (our model will not see this data during training). We can use our tokenized_dataset["test"] as it has already been preprocessed. |
tokenizer |
The tokenizer which was used to preprocess the data. Passing a tokenizer will also pad the inputs to maximum length when batching them. It will also be saved with the model so future re-runs are easier. |
compute_metrics |
An evaluation function to evaluate a model during training and evaluation steps. In our case, we’ll use the compute_accuracy function we defined earlier. |
With all this being said, let’s build our Trainer
!
from transformers import Trainer
# Setup Trainer
= Trainer(
trainer =model,
model=training_args,
args=tokenized_dataset["train"],
train_dataset=tokenized_dataset["test"],
eval_dataset=tokenizer, # Pass tokenizer to the Trainer for dynamic padding (padding as the training happens) (see "data_collator" in the Trainer docs)
tokenizer=compute_accuracy
compute_metrics )
Woohoo! We’ve created our own trainer
.
We’re one step closer to training!
6.5 Training our text classification model
We’ve done most of the hard word setting up our transformers.TrainingArguments
as well as our transformers.Trainer
.
Now how about we train a model?
Following our steps:
- ✅ Create and preprocess data.
- ✅ Define the model we’d like use with
transformers.AutoModelForSequenceClassification
(or another similar model class). - ✅ Define training arguments (these are hyperparameters for our model) with
transformers.TrainingArguments
. - ✅ Pass
TrainingArguments
from 3 and target datasets to an instance oftransformers.Trainer
. - Train the model by calling
Trainer.train()
. - Save the model (to our local machine or to the Hugging Face Hub).
- Evaluate the trained model by making and inspecting predctions on the test data.
- Turn the model into a shareable demo.
Looks like all we have to do is call transformers.Trainer.train()
.
We’ll be sure to save the results of the training to a variable results
so we can inspect them later.
Let’s try!
# Train a text classification model
= trainer.train() results
Epoch | Training Loss | Validation Loss | Accuracy |
---|---|---|---|
1 | 0.328100 | 0.039627 | 1.000000 |
2 | 0.019200 | 0.005586 | 1.000000 |
3 | 0.003700 | 0.002026 | 1.000000 |
4 | 0.001700 | 0.001186 | 1.000000 |
5 | 0.001100 | 0.000858 | 1.000000 |
6 | 0.000800 | 0.000704 | 1.000000 |
7 | 0.000800 | 0.000619 | 1.000000 |
8 | 0.000700 | 0.000571 | 1.000000 |
9 | 0.000600 | 0.000547 | 1.000000 |
10 | 0.000600 | 0.000539 | 1.000000 |
Woahhhh!!!
How cool is that!
We just trained a text classification model!
And it looks like the training went pretty quick (thanks to our smaller dataset and relatively small model, for larger datasets, training would likely take longer).
How about we check some of the metrics?
We can do so using the results.metrics
attribute (this returns a Python dictionary with stats from our training run).
# Inspect training metrics
for key, value in results.metrics.items():
print(f"{key}: {value}")
train_runtime: 7.5421
train_samples_per_second: 265.177
train_steps_per_second: 9.281
total_flos: 18110777160000.0
train_loss: 0.03574410408868321
epoch: 10.0
Nice!
Looks like our overall training runtime is low because of our small dataset.
And looks like our trainer
was able to process a fair few samples per second.
If we were to 1000x the size of our dataset (e.g. ~250 samples -> ~250,000 samples which is quite a substantial dataset), it seems our training time still wouldn’t take too long.
The total_flos
stands for “floating point operations” (also referred to as FLOPS), this is the total number of calculations our model has performed to find patterns in the data. And as you can see, it’s quite a large number!
Depending on the hardware you’re using, the results with respect to train_runtime
, train_samples_per_second
and train_steps_per_second
will likely be different.
The faster your accelerator hardware (e.g. NVIDIA GPU or Mac GPU), the lower your runtime and higher your samples/steps per second will be.
For reference, on my local NVIDIA RTX 4090, I get a train_runtime
of 8-9 seconds, train_samples_per_second
of 230-250 and train_steps_per_second
of 8.565.
6.6 Save the model for later use
Now our model has been trained, let’s save it for later use.
We’ll save it locally first and push it to the Hugging Face Hub later.
We can save our model using the transformers.Trainer.save_model
method.
# Save model
print(f"[INFO] Saving model to {model_save_dir}")
=model_save_dir) trainer.save_model(output_dir
[INFO] Saving model to models/learn_hf_food_not_food_text_classifier-distilbert-base-uncased
Model saved locally! Before we save it to the Hugging Face Hub, let’s check out its metrics.
6.7 Inspecting the model training metrics
We can get a log of our model’s training state using trainer.state.log_history
.
This will give us a collection of metrics per epoch (as long as we set logging_strategy="epoch"
in transformers.TrainingArguments
), in particular, it will give us a loss value per epoch.
We can extract these values and inspect them visually for a better understanding our model training.
Let’s get the training history and inspect it.
# Get training history
= trainer.state.log_history
trainer_history_all = trainer_history_all[:-1] # get everything except the training time metrics (we've seen these already)
trainer_history_metrics = trainer_history_all[-1] # this is the same value as results.metrics from above
trainer_history_training_time
# View the first 4 metrics from the training history
4] trainer_history_metrics[:
[{'loss': 0.3281,
'grad_norm': 0.6938912272453308,
'learning_rate': 9e-05,
'epoch': 1.0,
'step': 7},
{'eval_loss': 0.03962664306163788,
'eval_accuracy': 1.0,
'eval_runtime': 0.0135,
'eval_samples_per_second': 3707.312,
'eval_steps_per_second': 148.292,
'epoch': 1.0,
'step': 7},
{'loss': 0.0192,
'grad_norm': 0.14873287081718445,
'learning_rate': 8e-05,
'epoch': 2.0,
'step': 14},
{'eval_loss': 0.005585948005318642,
'eval_accuracy': 1.0,
'eval_runtime': 0.0147,
'eval_samples_per_second': 3399.06,
'eval_steps_per_second': 135.962,
'epoch': 2.0,
'step': 14}]
Okay, looks like the metrics are logged every epochs in a list Python dictionaries with interleaving loss
(this is the training set loss) and eval_loss
values.
How about we write some code to separate the training set metrics and the evaluation set metrics?
import pprint # import pretty print for nice printing of lists
# Extract training and evaluation metrics
= []
trainer_history_training_set = []
trainer_history_eval_set
# Loop through metrics and filter for training and eval metrics
for item in trainer_history_metrics:
= list(item.keys())
item_keys # Check to see if "eval" is in the keys of the item
if any("eval" in item for item in item_keys):
trainer_history_eval_set.append(item)else:
trainer_history_training_set.append(item)
# Show the first two items in each metric set
print(f"[INFO] First two items in training set:")
2])
pprint.pprint(trainer_history_training_set[:
print(f"\n[INFO] First two items in evaluation set:")
2]) pprint.pprint(trainer_history_eval_set[:
[INFO] First two items in training set:
[{'epoch': 1.0,
'grad_norm': 0.6938912272453308,
'learning_rate': 9e-05,
'loss': 0.3281,
'step': 7},
{'epoch': 2.0,
'grad_norm': 0.14873287081718445,
'learning_rate': 8e-05,
'loss': 0.0192,
'step': 14}]
[INFO] First two items in evaluation set:
[{'epoch': 1.0,
'eval_accuracy': 1.0,
'eval_loss': 0.03962664306163788,
'eval_runtime': 0.0135,
'eval_samples_per_second': 3707.312,
'eval_steps_per_second': 148.292,
'step': 7},
{'epoch': 2.0,
'eval_accuracy': 1.0,
'eval_loss': 0.005585948005318642,
'eval_runtime': 0.0147,
'eval_samples_per_second': 3399.06,
'eval_steps_per_second': 135.962,
'step': 14}]
Beautiful!
How about we take it a step further and turn our metrics into pandas DataFrames so we can view them easier?
# Create pandas DataFrames for the training and evaluation metrics
= pd.DataFrame(trainer_history_training_set)
trainer_history_training_df = pd.DataFrame(trainer_history_eval_set)
trainer_history_eval_df
trainer_history_training_df.head()
loss | grad_norm | learning_rate | epoch | step | |
---|---|---|---|---|---|
0 | 0.3281 | 0.693891 | 0.00009 | 1.0 | 7 |
1 | 0.0192 | 0.148733 | 0.00008 | 2.0 | 14 |
2 | 0.0037 | 0.037808 | 0.00007 | 3.0 | 21 |
3 | 0.0017 | 0.022227 | 0.00006 | 4.0 | 28 |
4 | 0.0011 | 0.018665 | 0.00005 | 5.0 | 35 |
Nice!
And the evaluation DataFrame?
trainer_history_eval_df.head()
eval_loss | eval_accuracy | eval_runtime | eval_samples_per_second | eval_steps_per_second | epoch | step | |
---|---|---|---|---|---|---|---|
0 | 0.039627 | 1.0 | 0.0135 | 3707.312 | 148.292 | 1.0 | 7 |
1 | 0.005586 | 1.0 | 0.0147 | 3399.060 | 135.962 | 2.0 | 14 |
2 | 0.002026 | 1.0 | 0.0136 | 3680.635 | 147.225 | 3.0 | 21 |
3 | 0.001186 | 1.0 | 0.0151 | 3303.902 | 132.156 | 4.0 | 28 |
4 | 0.000858 | 1.0 | 0.0159 | 3146.137 | 125.845 | 5.0 | 35 |
And of course, we’ll have follow the data explorer’s motto of visualize, visualize, visualize! and inspect our loss curves.
# Plot training and evaluation loss
import matplotlib.pyplot as plt
=(10, 6))
plt.figure(figsize"epoch"], trainer_history_training_df["loss"], label="Training loss")
plt.plot(trainer_history_training_df["epoch"], trainer_history_eval_df["eval_loss"], label="Evaluation loss")
plt.plot(trainer_history_eval_df["Epoch")
plt.xlabel("Loss")
plt.ylabel("Text classification with DistilBert training and evaluation loss over time")
plt.title(
plt.legend() plt.show()
B-e-a-utiful!
That is exactly what we wanted.
Training and evaluation loss going down over time.
6.8 Pushing our model to the Hugging Face Hub
We’ve saved our model locally and confirmed that it seems to be performing well on our training metrics but how about we push it to the Hugging Face Hub?
The Hugging Face Hub is one of the best sources of machine learning models on the internet.
And we can add our model there so others can use it or we can access it in the future (we could also keep it private on the Hugging Face Hub so only people from our organization can use it).
Sharing models on Hugging Face is also a great way to showcase your skills as a machine learning engineer, it gives you something to show potential employers and say “here’s what I’ve done”.
Before sharing a model to the Hugging Face Hub, be sure to go through the following steps:
- Setup a Hugging Face token using the
huggingface-cli login
command. - Read through the user access tokens guide.
- Set up an access token via https://huggingface.co/settings/tokens (ensure it has “write” access).
If you are using Google Colab, you can add your token under the “Secrets” tab on the left.
On my local computer, my token is saved to /home/daniel/.cache/huggingface/token
(thanks to running huggingface-cli login
on the command line).
And for more on sharing models to the Hugging Face Hub, be sure to check out the model sharing documentation.
We can push our model, tokenizer and other assosciated files to the Hugging Face Hub using the transformers.Trainer.push_to_hub
method.
We can also optionally do the following:
- Add a model card (something that describes how the model was created and what it can be used for) using
transformers.Trainer.create_model_card
. - Add a custom
README.md
file to the model repository to explain more details about the model usinghuggingface_hub.HfApi.upload_file
. This method is similar to model card creation method above but with more customization.
Let’s save our model to the Hub!
# Save our model to the Hugging Face Hub
# This will be public, since we set hub_private_repo=False in our TrainingArguments
= trainer.push_to_hub(
model_upload_url ="Uploading food not food text classifier model",
commit_message# token="YOUR_HF_TOKEN_HERE" # This will default to the token you have saved in your Hugging Face config
)print(f"[INFO] Model successfully uploaded to Hugging Face Hub with at URL: {model_upload_url}")
CommitInfo(commit_url='https://huggingface.co/mrdbourke/learn_hf_food_not_food_text_classifier-distilbert-base-uncased/commit/8a8a8aff5bdee5bc518e31558447dc684d448b8f', commit_message='Uploading food not food text classifier model', commit_description='', oid='8a8a8aff5bdee5bc518e31558447dc684d448b8f', pr_url=None, pr_revision=None, pr_num=None)
Model pushed to the Hugging Face Hub!
You may see the following error:
403 Forbidden: You don’t have the rights to create a model under the namespace “mrdbourke”. Cannot access content at: https://huggingface.co/api/repos/create. If you are trying to create or update content, make sure you have a token with the
write
role.
Or even:
HfHubHTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/api/repos/create (Request ID: Root=1-6699c52XXXXXX)
Invalid username or password.
In this case, be sure to go through the setup steps above to make sure you have a Hugging Face access token with “write” access.
And since it’s public (by default), you can see it at https://huggingface.co/mrdbourke/learn_hf_food_not_food_text_classifier-distilbert-base-uncased (it gets saved to the same name as our target local directory).
You can now share and interact with this model online.
As well as download it for use in your own applications.
But before we make an application/demo with our trained model, let’s keep evaluating it.
7 Making and evaluating predictions on the test data
Model trained, let’s now evaluate it on the test data.
Or step 7 in our workflow:
- ✅ Create and preprocess data.
- ✅ Define the model we’d like use with
transformers.AutoModelForSequenceClassification
(or another similar model class). - ✅ Define training arguments (these are hyperparameters for our model) with
transformers.TrainingArguments
. - ✅ Pass
TrainingArguments
from 3 and target datasets to an instance oftransformers.Trainer
. - ✅ Train the model by calling
Trainer.train()
. - ✅ Save the model (to our local machine or to the Hugging Face Hub).
- Evaluate the trained model by making and inspecting predctions on the test data.
- Turn the model into a shareable demo.
A reminder that the test data is data that our model has never seen before.
So it will be a good estimate of how our model will do in a production setting.
We can make predictions on the test dataset using transformers.Trainer.predict
.
And then we can get the prediction values with the predictions
attribute and assosciated metrics with the metrics
attribute.
# Perform predictions on the test set
= trainer.predict(tokenized_dataset["test"])
predictions_all = predictions_all.predictions
prediction_values = predictions_all.metrics
prediction_metrics
print(f"[INFO] Prediction metrics on the test data:")
prediction_metrics
[INFO] Prediction metrics on the test data:
{'test_loss': 0.0005385442636907101,
'test_accuracy': 1.0,
'test_runtime': 0.0421,
'test_samples_per_second': 1186.857,
'test_steps_per_second': 47.474}
Woah!
Looks like our model did an outstanding job!
And it was very quick too.
This is one of the benefits of using a smaller pretrained model and customizing it to your own dataset.
You can achieve outstanding results in a very quick time as well as have a model capable of performing thousands of predictions per second.
We can also calculate the accuracy by hand by comparing the prediction labels to the test labels.
To do so, we’ll:
- Calculate the prediction probabilities (though this is optional as we could skip straight to 2 and get the same results) by passing the
prediction_values
totorch.softmax
. - Find the index of the prediction value with the highest value (the index will be equivalent to the predicted label) using
torch.argmax
(we could also usenp.argmax
here) to find the predicted labels. - Get the true labels from the test dataset using
dataset["test"]["label"]
. - Compare the predicted labels from 2 to the true labels from 3 using
sklearn.metrics.accuracy_score
to find the accuracy.
import torch
from sklearn.metrics import accuracy_score
# 1. Get prediction probabilities (this is optional, could get the same results with step 2 onwards)
= torch.softmax(torch.tensor(prediction_values), dim=1)
pred_probs
# 2. Get the predicted labels
= torch.argmax(pred_probs, dim=1)
pred_labels
# 3. Get the true labels
= dataset["test"]["label"]
true_labels
# 4. Compare predicted labels to true labels to get the test accuracy
= accuracy_score(y_true=true_labels,
test_accuracy =pred_labels)
y_pred
print(f"[INFO] Test accuracy: {test_accuracy*100}%")
[INFO] Test accuracy: 100.0%
Woah!
Looks like our model performs really well on our test set.
It will be interesting to see how it goes on real world samples.
We’ll test this later on.
How about we make a pandas DataFrame out of our test samples, predicted labels and predicted probabilities to further inspect our results?
# Make a DataFrame of test predictions
= pd.DataFrame({
test_predictions_df "text": dataset["test"]["text"],
"true_label": true_labels,
"pred_label": pred_labels,
"pred_prob": torch.max(pred_probs, dim=1).values
})
test_predictions_df.head()
text | true_label | pred_label | pred_prob | |
---|---|---|---|---|
0 | A slice of pepperoni pizza with a layer of mel... | 1 | 1 | 0.999369 |
1 | Red brick fireplace with a mantel serving as a... | 0 | 0 | 0.999662 |
2 | A bowl of sliced bell peppers with a sprinkle ... | 1 | 1 | 0.999365 |
3 | Set of mugs hanging on a hook | 0 | 0 | 0.999682 |
4 | Standing floor lamp providing light next to an... | 0 | 0 | 0.999678 |
We can find the examples with the lowest prediction probability to see where the model is unsure.
# Show 10 examples with low prediction probability
"pred_prob", ascending=True).head(10) test_predictions_df.sort_values(
text | true_label | pred_label | pred_prob | |
---|---|---|---|---|
40 | A bowl of cherries with a sprig of mint for ga... | 1 | 1 | 0.999331 |
11 | A close-up shot of a cheesy pizza slice being ... | 1 | 1 | 0.999348 |
26 | A fruit platter with a variety of exotic fruit... | 1 | 1 | 0.999351 |
42 | Boxes of apples, pears, pineapple, manadrins a... | 1 | 1 | 0.999353 |
46 | A bowl of sliced kiwi with a sprinkle of sugar... | 1 | 1 | 0.999360 |
37 | Close-up of a sushi roll with avocado, cucumbe... | 1 | 1 | 0.999360 |
31 | Crunchy sushi roll with tempura flakes or pank... | 1 | 1 | 0.999360 |
9 | Cherry tomatoes and mozzarella balls in a bowl... | 1 | 1 | 0.999360 |
14 | Two handfuls of bananas in a fruit bowl with g... | 1 | 1 | 0.999360 |
44 | Seasonal sushi roll with ingredients like pers... | 1 | 1 | 0.999361 |
Hmmm, it looks like our model has quite a high prediction probability for almost all samples.
We can further evalaute our model by making predictions on new custom data.
8 Making and inspecting predictions on custom text data
We’ve seen how our model performs on the test dataset (quite well).
But how might we check its performance on our own custom data?
For example, text-based image captions from the wild.
Well, we’ve got two ways to load our model now too:
- Load model locally from our computer (e.g. via
models/learn_hf_food_not_food_text_classifier-distilbert-base-uncased
). - Load model from Hugging Face Hub (e.g. via
mrdbourke/learn_hf_food_not_food_text_classifier-distilbert-base-uncased
).
Either way of loading the model results in the same outcome: being able to make predictions on given data.
So how about we start by setting up our model paths for both local loading and loading from the Hugging Face Hub.
# Setup local model path
= "models/learn_hf_food_not_food_text_classifier-distilbert-base-uncased"
local_model_path
# Setup Hugging Face model path (see: https://huggingface.co/mrdbourke/learn_hf_food_not_food_text_classifier-distilbert-base-uncased)
# Note: Be sure to change "mrdbourke" to your own Hugging Face username
= "mrdbourke/learn_hf_food_not_food_text_classifier-distilbert-base-uncased" huggingface_model_path
8.1 Discussing ways to make predictions (inference)
When we’ve loaded our trained model, because of the way we’ve set it up, there are two main ways to make predictions on custom data:
- Pipeline mode using
transformers.pipeline
and passing it our target model, this allows us to preprocess custom data and make predictions in one step. - PyTorch mode using a combination of
transformers.AutoTokenizer
andtransformers.AutoModelForSequenceClassification
and passing each our target model, this requires us to preprocess our data before passing to a model, however, it offers the most customization.
Each method supports:
- Predictions one at a time (batch size of 1), for example, one person using the app at a time.
- Batches of predictions at a time (predictions with a batch size of
n
wheren
can be any number, e.g.8
,16
,32
), for example, many people using a service simultaneously such as a voice chat and needing to filter comments (predicting on batches of sizen
is usually much faster than batches of 1).
Whichever method we choose, we’ll have to set the target device we’d like the operations to happen on.
In general, it’s best to make predictions on the most powerful accelerator you have available.
And in most cases that will be a NVIDIA GPU > Mac GPU > CPU.
So let’s write a small function to pick the target device for us in that order.
Making predictions is also referred to as inference.
Because the model is going to infer on some data what the output should be.
Inference is often faster than training on a per sample basis as no model weights are updated (less computation).
However, inference can use more compute than training over the long run because you could train a model once over a few hours (or days or longer) and then use it for inference for several months (or longer), millions of times (or more).
def set_device():
"""
Set device to CUDA if available, else MPS (Mac), else CPU.
This defaults to using the best available device (usually).
"""
if torch.cuda.is_available():
= torch.device("cuda")
device elif torch.backends.mps.is_available() and torch.backends.mps.is_built():
= torch.device("mps")
device else:
= torch.device("cpu")
device return device
= set_device()
DEVICE print(f"[INFO] Using device: {DEVICE}")
[INFO] Using device: cuda
Target device set!
Let’s start predicting.
8.2 Making predictions with pipeline
The transformers.pipeline
method creates a machine learning pipeline.
Data goes in one end and predictions come out the other end.
You can create pipelines for many different tasks, such as, text classification, image classification, object detection, text generation and more.
Let’s see how we can create a pipeline for our text classification model.
To do so we’ll:
- Instantiate an instance of
transformers.pipeline
. - Pass in the
task
parameter oftext-classification
(we can do this because our model is already formatted for text classification thanks to usingtransformers.AutoModelForSequenceClassification
). - Setup the
model
parameter to belocal_model_path
(though we could also usehuggingface_model_path
). - Set the target device using the
device
parameter. - Set
top_k=1
to get to the top prediction back (e.g. either"food"
or"not_food"
, could set this higher to get more labels back). - Set the
BATCH_SIZE=32
so we can pass to thebatch_size
parameter. This will allow our model to make predictions on up to32
samples at a time. Predicting on batches of data is usually much faster than single samples at a time, however, this often saturates at a point (e.g. predicting on batches of size 64 may be the same speed as 32 due to memory contraints).
There are many more pipelines available in the Hugging Face documentation.
As an exericse, I’d spend 10-15 minutes reading through the pipeline documentation to get familiar with what’s available.
Let’s setup our pipeline!
import torch
from transformers import pipeline
# Set the batch size for predictions
= 32
BATCH_SIZE
# Create an instance of transformers.pipeline
= pipeline(task="text-classification", # we can use this because our model is an instance of AutoModelForSequenceClassification
food_not_food_classifier =local_model_path, # could also pass in huggingface_model_path
model=DEVICE, # set the target device
device=1, # only return the top predicted value
top_k=BATCH_SIZE) # perform predictions on up to BATCH_SIZE number of samples at a time
batch_size
food_not_food_classifier
<transformers.pipelines.text_classification.TextClassificationPipeline at 0x7f2695245950>
We’ve created an instance of transformers.pipelines.text_classification.TextClassificationPipeline
!
Now let’s test it out by passing it a string of text about food.
# Test our trained model on some example text
= "A delicious photo of a plate of scrambled eggs, bacon and toast"
sample_text_food food_not_food_classifier(sample_text_food)
[[{'label': 'food', 'score': 0.9993335604667664}]]
Nice! Our model gets it right.
How about a string not about food?
# Test the model on some more example text
= "A yellow tractor driving over the hill"
sample_text_not_food food_not_food_classifier(sample_text_not_food)
[[{'label': 'not_food', 'score': 0.9996254444122314}]]
Woohoo!
Correct again!
What if we passed in random text?
As in, someone types in something random to the model expecting an output.
# Pass in random text to the model
"cvnhertiejhwgdjshdfgh394587") food_not_food_classifier(
[[{'label': 'not_food', 'score': 0.9985743761062622}]]
The nature of machine learning models is that they are a predictive/generative function.
If you input data, they will output something.
When deploying machine learning models, there are many things to take into consideration.
One of the main ones being: “what data is going to go into the model?”
If this was a public facing model and people could enter any kind of text, they could enter random text rather than a sentence about food or not food.
Since our main goal of the model is be able to classify image captions into food
/not_food
, we’d also have to consider image cpations that are poorly written or contain little text.
This is why it’s important to continually test your models with as much example test/real-world data as you can.
Our pipeline
can also work with the model we saved to the Hugging Face Hub.
Let’s try out the same pipeline with model=hugggingface_model_path
.
# Pipeline also works with remote models (will have to laod the model locally first)
= pipeline(task="text-classification",
food_not_food_classifier_remote =huggingface_model_path, # load the model from Hugging Face Hub (will download the model if it doesn't already exist)
model=BATCH_SIZE,
batch_size=DEVICE)
device
"This is some new text about bananas and pancakes and ice cream") food_not_food_classifier_remote(
[{'label': 'food', 'score': 0.9993208646774292}]
Beautiful!
Our model loaded from Hugging Face gets it right too!
8.3 Making multiple predictions at the same time with batch prediction
We can make predictions with our model one at a time but it’s often much faster to do them in batches.
To make predictions in batches, we can set up our transformers.pipeline
instance with the batch_size
parameter greater than 1
.
Then we’ll be able to pass multiple samples at once in the form of a Python list.
# Create batch size (we don't need to do this again but we're doing it for clarity)
= 32 # this number is experimental and will require testing on your hardware to find the optimal value (e.g. lower if there are memory issues or higher to try speed up inference)
BATCH_SIZE
# Setup pipeline to handle batches (we don't need to do this again either but we're doing it for clarity)
= pipeline(task="text-classification",
food_not_food_classifier =local_model_path,
model=BATCH_SIZE,
batch_size=DEVICE) device
Wonderful, now we’ve set up a pipeline
instance capable of handling batches, we can pass it a list of samples and it will make predictions on each.
How about we try with a collection of sentences which are a bit tricky?
# Create a list of sentences to make predictions on
= [
sentences "I whipped up a fresh batch of code, but it seems to have a syntax error.",
"We need to marinate these ideas overnight before presenting them to the client.",
"The new software is definitely a spicy upgrade, taking some time to get used to.",
"Her social media post was the perfect recipe for a viral sensation.",
"He served up a rebuttal full of facts, leaving his opponent speechless.",
"The team needs to simmer down a bit before tackling the next challenge.",
"The presentation was a delicious blend of humor and information, keeping the audience engaged.",
"A beautiful array of fake wax foods (shokuhin sampuru) in the front of a Japanese restaurant.",
"Daniel Bourke is really cool :D",
"My favoruite food is biltong!"
]
food_not_food_classifier(sentences)
[{'label': 'not_food', 'score': 0.9986234903335571},
{'label': 'not_food', 'score': 0.9993952512741089},
{'label': 'not_food', 'score': 0.9992876648902893},
{'label': 'not_food', 'score': 0.9994683861732483},
{'label': 'not_food', 'score': 0.9993450045585632},
{'label': 'not_food', 'score': 0.9994571805000305},
{'label': 'not_food', 'score': 0.9991866946220398},
{'label': 'food', 'score': 0.9993101358413696},
{'label': 'not_food', 'score': 0.9995250701904297},
{'label': 'food', 'score': 0.9966572523117065}]
Woah! That was quick!
And it looks like our model performed fairly well.
Though there was one harder sample which may be deemed as food
/not_food
, the sentence containing “shokuhin sampuru” (meaning “food model” in Japanese).
Is a sentence about food models (fake foods) still about food?
8.4 Time our model across larger sample sizes
We can say that our model is fast or that making predictions in batches is faster than one at a time.
But how about we run some tests to confirm this?
Let’s start by making predictions one at a time across 100 sentences (10x our sentences
list) and then we’ll write some code to make predictions in batches.
We’ll time each and see how they go.
import time
# Create 1000 sentences
= sentences * 100
sentences_1000
# Time how long it takes to make predictions on all sentences (one at a time)
print(f"[INFO] Number of sentences: {len(sentences_1000)}")
= time.time()
start_time_one_at_a_time for sentence in sentences_1000:
# Make a prediction on each sentence one at a time
food_not_food_classifier(sentence)= time.time()
end_time_one_at_a_time
print(f"[INFO] Time taken for one at a time prediction: {end_time_one_at_a_time - start_time_one_at_a_time} seconds")
print(f"[INFO] Avg inference time per sentence: {(end_time_one_at_a_time - start_time_one_at_a_time) / len(sentences_1000)} seconds")
[INFO] Number of sentences: 1000
[INFO] Time taken for one at a time prediction: 2.5376925468444824 seconds
[INFO] Avg inference time per sentence: 0.0025376925468444823 seconds
Ok, on my local NVIDIA RTX 4090 GPU, it took around 5.5 seconds to make 1000 predictions one at a time.
That’s pretty good!
But let’s see if we can make it faster with batching.
To do so, we can increase the size of our sentences_big
list and pass the list directly to the model to enable batched prediction.
for i in [10, 100, 1000, 10_000]:
= sentences * i
sentences_big print(f"[INFO] Number of sentences: {len(sentences_big)}")
= time.time()
start_time # Predict on all sentences in batches
food_not_food_classifier(sentences_big)= time.time()
end_time
print(f"[INFO] Inference time for {len(sentences_big)} sentences: {round(end_time - start_time, 5)} seconds.")
print(f"[INFO] Avg inference time per sentence: {round((end_time - start_time) / len(sentences_big), 8)} seconds.")
print()
[INFO] Number of sentences: 100
[INFO] Inference time for 100 sentences: 0.04512 seconds.
[INFO] Avg inference time per sentence: 0.0004512 seconds.
[INFO] Number of sentences: 1000
[INFO] Inference time for 1000 sentences: 0.31447 seconds.
[INFO] Avg inference time per sentence: 0.00031447 seconds.
[INFO] Number of sentences: 10000
[INFO] Inference time for 10000 sentences: 1.82615 seconds.
[INFO] Avg inference time per sentence: 0.00018261 seconds.
[INFO] Number of sentences: 100000
[INFO] Inference time for 100000 sentences: 18.63373 seconds.
[INFO] Avg inference time per sentence: 0.00018634 seconds.
Woah!
It looks like inference/prediction time is ~10-20x faster when using batched prediction versus predicting one at a time (on my local NVIDIA RTX 4090).
I ran some more tests with the same model on a different GPU on Google Colab (NVIDIA L4 GPU) and got similar results.
Number of Sentences | Total Prediction Time | Prediction Type |
---|---|---|
100 | 0.62 | one at a time |
1000 | 6.19 | one at a time |
10000 | 61.08 | one at a time |
100000 | 605.46 | one at a time |
100 | 0.06 | batch |
1000 | 0.51 | batch |
10000 | 4.97 | batch |
100000 | 49.7 | batch |
Testing the speed of a custom text classifier model on different numbers of sentences with one at a time or batched prediction. Tests conducted on Google Colab with a NVIDIA L4 GPU. See the notebook for code to reproduce.
8.5 Making predictions with PyTorch
We’ve seen how to make predictions/perform inference with transformers.pipeline
, now let’s see how to do the same with PyTorch.
Performing predictions with PyTorch requires an extra step compared to pipeline
, we have to prepare our inputs first (turn the text into numbers).
Good news is, we can prepare our inputs with the tokenizer that got automatically saved with our model.
And since we’ve already trained a model and uploaded it to the Hugging Face Hub, we can load our model and tokenizer with transformers.AutoTokenizer
and transformers.AutoModelForSequenceClassification
passing it the saved path we used (mine is mrdbourke/learn_hf_food_not_food_text_classifier-distilbert-base-uncased
).
Let’s start by loading the tokenizer and see what it looks like to tokenize a piece of sample text.
from transformers import AutoTokenizer
# Setup model path (can be local or on Hugging Face)
# Note: Be sure to change "mrdbourke" to your own username
= "mrdbourke/learn_hf_food_not_food_text_classifier-distilbert-base-uncased"
model_path
# Create an example to predict on
= "A delicious photo of a plate of scrambled eggs, bacon and toast"
sample_text_food
# Prepare the tokenizer and tokenize the inputs
= AutoTokenizer.from_pretrained(pretrained_model_name_or_path=model_path)
tokenizer = tokenizer(sample_text_food,
inputs ="pt") # return the output as PyTorch tensors
return_tensors inputs
{'input_ids': tensor([[ 101, 1037, 12090, 6302, 1997, 1037, 5127, 1997, 13501, 6763,
1010, 11611, 1998, 15174, 102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
Nice!
Text tokenized!
We get a dictionary of input_ids
(our text in token form) and attention_mask
(tells the model which tokens to pay attention to, 1
= pay attention, 0
= no attention).
Now we can load the model with the same path.
from transformers import AutoModelForSequenceClassification
# Load our text classification model
= AutoModelForSequenceClassification.from_pretrained(pretrained_model_name_or_path=model_path) model
Model loaded!
Let’s make a prediction.
We can do so using the context manager torch.no_grad()
(because no gradients/weights get updated during inference) and passing our model out inputs
dictionary.
A little tidbit about using dictionaries as function inputs in Python is the ability to unpack the keys of the dictionary into function arguments.
This is possible using **TARGET_DICTIONARY
syntax. Where the **
means “use all the keys in the dictionary as function parameters”.
For example, the following two lines are equivalent:
# Using ** notation
outputs = model(**inputs)
# Using explicit notation
outputs = model(input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"])
Let’s make a prediction with PyTorch!
import torch
with torch.no_grad():
= model(**inputs) # '**' means input all of the dictionary keys as arguments to the function
outputs # outputs = model(input_ids=inputs["input_ids"],
# attention_mask=inputs["attention_mask"]) # same as above, but explicitly passing in the keys
outputs
SequenceClassifierOutput(loss=None, logits=tensor([[-3.3686, 3.9443]]), hidden_states=None, attentions=None)
Beautiful, we’ve got some outputs, which contain logits
with two values (one for each class).
The index of the higher value is our model’s predicted class.
We can find it by taking the outputs.logits
and calling argmax().item()
on it.
We can also find the prediction probability by passing outputs.logits
to torch.softmax
.
# Get predicted class and prediction probability
= outputs.logits.argmax().item()
predicted_class_id = torch.softmax(outputs.logits, dim=1).max().item()
prediction_probability
print(f"Text: {sample_text_food}")
print(f"Predicted label: {model.config.id2label[predicted_class_id]}")
print(f"Prediction probability: {prediction_probability}")
Text: A delicious photo of a plate of scrambled eggs, bacon and toast
Predicted label: food
Prediction probability: 0.9993335604667664
Beautiful! A prediction made with pure PyTorch! It looks very much correct too.
How about we put it all together?
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
= "mrdbourke/learn_hf_food_not_food_text_classifier-distilbert-base-uncased"
model_path
# Load the model and tokenizer
= AutoTokenizer.from_pretrained(pretrained_model_name_or_path=model_path)
tokenizer = AutoModelForSequenceClassification.from_pretrained(pretrained_model_name_or_path=model_path)
model
# Make sample text and tokenize it
= "A photo of a broccoli, salmon, rice and radish dish"
sample_text = tokenizer(sample_text, return_tensors="pt")
inputs
# Make a prediction
with torch.no_grad():
= model(**inputs)
outputs
# Get predicted class and prediction probability
= outputs.logits
output_logits = torch.argmax(output_logits, dim=1).item()
predicted_class_id = model.config.id2label[predicted_class_id]
predicted_class_label = torch.softmax(output_logits, dim=1).max().item()
predicted_probability
# Print outputs
print(f"Text: {sample_text}")
print(f"Predicted class: {predicted_class_label} (prob: {predicted_probability * 100:.2f}%)")
Text: A photo of a broccoli, salmon, rice and radish dish
Predicted class: food (prob: 99.94%)
9 Putting it all together
Ok, ok, we’ve covered a lot of ground going from dataset to trained model to making predictions on custom samples.
How about we put all of the steps we’ve covered so far together in a single code cell (or two)?
To do so, we’ll:
- Import necessary packages (e.g.
datasets
,transformers.pipeline
,torch
and more). - Setup variables for model training and saving pipeline such as our model name, save directory and dataset name.
- Create a directory for saving models.
- Load and preprocess the dataset from Hugging Face Hub using
datasets.load_dataset
. - Import a tokenizer with
transformers.AutoTokenizer
and map it to our dataset withdataset.map
. - Set up an evaluation metric with
evaluate
& create a function to evaluate our model’s predictions. - Import a model with
transformers.AutoModelForSequenceClassification
and prepare it for training withtransformers.TrainingArguments
andtransformers.Trainer
. - Train the model on our text dataset by calling
transformers.Trainer.train
. - Save the trained model to a local directory.
- Push the model to the Hugging Face Hub.
- Evaluate the model on the test data.
- Test the trained model on a custom sample using
transformers.pipeline
to make sure it works.
Phew!
A fair few steps but nothing we can’t handle!
Let’s do it.
# 1. Import necessary packages
import pprint
from pathlib import Path
import numpy as np
import torch
import datasets
import evaluate
from transformers import pipeline
from transformers import TrainingArguments, Trainer
from transformers import AutoTokenizer, AutoModelForSequenceClassification
# 2. Setup variables for model training and saving pipeline
= "mrdbourke/learn_hf_food_not_food_image_captions"
DATASET_NAME = "distilbert/distilbert-base-uncased"
MODEL_NAME = "models/learn_hf_food_not_food_text_classifier-distilbert-base-uncased"
MODEL_SAVE_DIR_NAME
# 3. Create a directory for saving models
# Note: This will override our existing saved model (if there is one)
print(f"[INFO] Creating directory for saving models: {MODEL_SAVE_DIR_NAME}")
= Path(MODEL_SAVE_DIR_NAME)
model_save_dir =True, exist_ok=True)
model_save_dir.mkdir(parents
# 4. Load and preprocess the dataset from Hugging Face Hub
print(f"[INFO] Downloading dataset from Hugging Face Hub, name: {DATASET_NAME}")
= datasets.load_dataset(path=DATASET_NAME)
dataset
# Create mappings from id2label and label2id (adjust these for your target dataset, can also create these programmatically)
= {0: "not_food", 1: "food"}
id2label = {"not_food": 0, "food": 1}
label2id
# Create function to map IDs to labels in dataset
def map_labels_to_number(example):
"label"] = label2id[example["label"]]
example[return example
# Map preprocessing function to dataset
= dataset["train"].map(map_labels_to_number)
dataset
# Split the dataset into train/test sets
= dataset.train_test_split(test_size=0.2, seed=42)
dataset
# 5. Import a tokenizer and map it to our dataset
print(f"[INFO] Tokenizing text for model training with tokenizer: {MODEL_NAME}")
= AutoTokenizer.from_pretrained(pretrained_model_name_or_path=MODEL_NAME,
tokenizer =True)
use_fast
# Create a preprocessing function to tokenize text
def tokenize_text(examples):
return tokenizer(examples["text"],
=True,
padding=True)
truncation
= dataset.map(function=tokenize_text,
tokenized_dataset =True,
batched=1000)
batch_size
# 6. Set up an evaluation metric & function to evaluate our model
= evaluate.load("accuracy")
accuracy_metric
def compute_accuracy(predictions_and_labels):
= predictions_and_labels
predictions, labels
if len(predictions.shape) >= 2:
= np.argmax(predictions, axis=1)
predictions
return accuracy_metric.compute(predictions=predictions, references=labels) # note: use "references" parameter rather than "labels"
# 7. Import a model and prepare it for training
print(f"[INFO] Loading model: {MODEL_NAME}")
= AutoModelForSequenceClassification.from_pretrained(
model =MODEL_NAME,
pretrained_model_name_or_path=2,
num_labels=id2label,
id2label=label2id
label2id
)print(f"[INFO] Model loading complete!")
# Setup TrainingArguments
= TrainingArguments(
training_args =model_save_dir,
output_dir=0.0001,
learning_rate=32,
per_device_train_batch_size=32,
per_device_eval_batch_size=10,
num_train_epochs="epoch",
eval_strategy="epoch",
save_strategy=3,
save_total_limit=False,
use_cpu=42,
seed=True,
load_best_model_at_end="epoch",
logging_strategy="none",
report_to=False,
push_to_hub=False # Note: if set to False, your model will be publically available
hub_private_repo
)
# Create Trainer instance and train model
= Trainer(
trainer =model,
model=training_args,
args=tokenized_dataset["train"],
train_dataset=tokenized_dataset["test"],
eval_dataset=tokenizer,
tokenizer=compute_accuracy
compute_metrics
)
# 8. Train the model on our text dataset
print(f"[INFO] Commencing model training...")
= trainer.train()
results
# 9. Save the trained model (note: this will overwrite our previous model, this is ok)
print(f"[INFO] Model training complete, saving model to local path: {model_save_dir}")
=model_save_dir)
trainer.save_model(output_dir
# 10. Push the model to the Hugging Face Hub
print(f"[INFO] Uploading model to Hugging Face Hub...")
= trainer.push_to_hub(
model_upload_url ="Uploading food not food text classifier model",
commit_message# token="YOUR_HF_TOKEN_HERE" # requires a "write" HF token
)print(f"[INFO] Model upload complete, model available at: {model_upload_url}")
# 11. Evaluate the model on the test data
print(f"[INFO] Performing evaluation on test dataset...")
= trainer.predict(tokenized_dataset["test"])
predictions_all = predictions_all.predictions
prediction_values = predictions_all.metrics
prediction_metrics
print(f"[INFO] Prediction metrics on the test data:")
pprint.pprint(prediction_metrics)
[INFO] Creating directory for saving models: models/learn_hf_food_not_food_text_classifier-distilbert-base-uncased
[INFO] Downloading dataset from Hugging Face Hub, name: mrdbourke/learn_hf_food_not_food_image_captions
[INFO] Tokenizing text for model training with tokenizer: distilbert/distilbert-base-uncased
[INFO] Loading model: distilbert/distilbert-base-uncased
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[INFO] Model loading complete!
[INFO] Commencing model training...
Epoch | Training Loss | Validation Loss | Accuracy |
---|---|---|---|
1 | 0.372500 | 0.067892 | 1.000000 |
2 | 0.028300 | 0.009194 | 1.000000 |
3 | 0.004700 | 0.004919 | 1.000000 |
4 | 0.002000 | 0.002121 | 1.000000 |
5 | 0.001200 | 0.001302 | 1.000000 |
6 | 0.000900 | 0.000982 | 1.000000 |
7 | 0.000800 | 0.000839 | 1.000000 |
8 | 0.000700 | 0.000766 | 1.000000 |
9 | 0.000700 | 0.000728 | 1.000000 |
10 | 0.000700 | 0.000715 | 1.000000 |
[INFO] Model training complete, saving model to local path: models/learn_hf_food_not_food_text_classifier-distilbert-base-uncased
[INFO] Uploading model to Hugging Face Hub...
[INFO] Model upload complete, model available at: https://huggingface.co/mrdbourke/learn_hf_food_not_food_text_classifier-distilbert-base-uncased/tree/main/
[INFO] Performing evaluation on test dataset...
[INFO] Prediction metrics on the test data:
{'test_accuracy': 1.0,
'test_loss': 0.0007152689504437149,
'test_runtime': 0.0507,
'test_samples_per_second': 986.278,
'test_steps_per_second': 39.451}
Woohoo! It all worked!
Now let’s make it sure works by turing it into a transformers.pipeline
and passing it a custom sample.
# 12. Make sure the model works by testing it on a custom sample
= pipeline(task="text-classification",
food_not_food_classifier =model_save_dir, # can also use model on Hugging Face Hub path
model=torch.device("cuda") if torch.cuda.is_available() else "cpu",
device=1,
top_k=32)
batch_size
"Yo! We just built a food not food sentence classifier model! Good news is, it can be replicated for other kinds of text classification!") food_not_food_classifier(
[[{'label': 'food', 'score': 0.9969706535339355}]]
Nice!
Looks like putting all of our code in one cell worked.
How about we make our model even more accessible by turning it into a demo?
10 Turning our model into a demo
Once you’ve trained and saved a model, one of the best ways to continue to test it and show/share it with others is to create a demo.
Or step number 8 in our workflow:
- ✅ Create and preprocess data.
- ✅ Define the model we’d like use with
transformers.AutoModelForSequenceClassification
(or another similar model class). - ✅ Define training arguments (these are hyperparameters for our model) with
transformers.TrainingArguments
. - ✅ Pass
TrainingArguments
from 3 and target datasets to an instance oftransformers.Trainer
. - ✅ Train the model by calling
Trainer.train()
. - ✅ Save the model (to our local machine or to the Hugging Face Hub).
- ✅ Evaluate the trained model by making and inspecting predctions on the test data.
- Turn the model into a shareable demo.
A demo is a small application with the focus of showing the workflow of your model from data in to data out.
It’s also one way to start testing your model in the wild.
You may know where it works and where it doesn’t but chances are someone out there will find a new bug before you do.
To build our demo, we’re going to use an open-source library called Gradio.
Gradio allows you to make machine learning demo apps with Python code and best of all, it’s part of the Hugging Face ecosystem so you can share your demo to the public directly through Hugging Face.
Gradio works on the premise of input -> function (this could be a model) -> output.
In our case:
- Input = A string of text.
- Function = Our trained text classification model.
- Output = Predicted output of food/not_food with prediction probability.
10.1 Creating a simple function to perform inference
Let’s create a function to take an input of text, process it with our model and return a dictionary of the predicted labels.
Our function will:
- Take an input of a string of text.
- Setup a text classification pipeline using
transformers.pipeline
as well as our trained model (this can be from our local machine or loaded from Hugging Face). We’ll return all the probabilities from the output usingtop_k=None
. - Get the outputs of the text classification pipeline from 2 as a list of dictionaries (e.g.
[{'label': 'food', 'score': 0.999105}, {'label': 'not_food', 'score': 0.00089}]
). - Format and return the list of dictionaries from 3 to be compatible with Gradio’s
gr.Label
output (we’ll see this later) which requires a dictionary in the form[{"label_1": probability_1, "label_2": probability_2}]
.
Onward!
from typing import Dict
# 1. Create a function which takes text as input
def food_not_food_classifier(text: str) -> Dict[str, float]:
"""
Takes an input string of text and classifies it into food/not_food in the form of a dictionary.
"""
# 2. Setup the pipeline to use the local model (or Hugging Face model path)
= pipeline(task="text-classification",
food_not_food_classifier =local_model_path,
model=32,
batch_size="cuda" if torch.cuda.is_available() else "cpu", # set the device to work in any environment
device=None) # return all possible scores (not just top-1)
top_k
# 3. Get outputs from pipeline (as a list of dicts)
= food_not_food_classifier(text)[0]
outputs
# 4. Format output for Gradio (e.g. {"label_1": probability_1, "label_2": probability_2})
= {}
output_dict for item in outputs:
"label"]] = item["score"]
output_dict[item[
return output_dict
# Test out the function
"My lunch today was chicken and salad") food_not_food_classifier(
{'food': 0.9992194175720215, 'not_food': 0.0007805348141118884}
Beautiful!
Looks like our function is working.
10.2 Building a small Gradio demo to run locally
We’ve got a working function to go from text to predicted labels and probabilities.
Let’s now build a Gradio interface to showcase our model.
We can do so by:
- Importing Gradio (using
import gradio as gr
). - Creating an instance of
gr.Interface
with parametersinputs="text"
(for our text-based inputs) calleddemo
andoutputs=gr.Label(num_top_classes=2)
to display our output dictionary. We can also add some descriptive aspects to ourdemo
with thetitle
,description
andexamples
parameters. - Running/launching the demo with
gr.Interface.launch()
.
# 1. Import Gradio as the common alias "gr"
import gradio as gr
# 2. Setup a Gradio interface to accept text and output labels
= gr.Interface(
demo =food_not_food_classifier,
fn="text",
inputs=gr.Label(num_top_classes=2), # show top 2 classes (that's all we have)
outputs="Food or Not Food Classifier",
title="A text classifier to determine if a sentence is about food or not food.",
description=[["I whipped up a fresh batch of code, but it seems to have a syntax error."],
examples"A delicious photo of a plate of scrambled eggs, bacon and toast."]])
[
# 3. Launch the interface
demo.launch()
Running on local URL: http://127.0.0.1:7860
To create a public link, set `share=True` in `launch()`.
Woohoo!
We’ve made a very clean way of interacting with our model.
However, our model is still only largely accessible to us (except for the model file we’ve uploaded to Hugging Face).
How about we make our demo publicly available so it’s even easier for people to interact with our model?
The gradio.Interface
class is full of many different options, I’d highly recommend reading through the documentation for 10-15 minutes to get an idea of it.
If your workflow requires inputs -> function (e.g. a model making predictions on the input) -> output, chances are, you can build it with Gradio.
11 Making our demo publicly accessible
One of the best ways to share your machine learning work is by creating an application.
And one of the best places to share your applications is Hugging Face Spaces.
Hugging Face Spaces allows you to host machine learning (and non-machine learning) applications for free (with optional paid hardware upgrades).
If you’re familiar with GitHub, Hugging Face Spaces works similar to a GitHub repository (each Space is a Git repository itself).
If not, that’s okay, think of Hugging Face Spaces as an online folder where you can upload your files and have them accessed by others.
Creating a Hugging Face Space can be done in two main ways:
- Manually - By going to the Hugging Face Spaces website and clicking “Create new space”. Or by going directly to https://www.huggingface.co/new-space. Here, you’ll be able to setup a few settings for your Space and choose the framework/runtime (e.g. Streamlit, Gradio, Docker and more).
- Programmatically - By using the Hugging Face Hub Python API we can write code to directly upload files to the Hugging Face Hub, including Hugging Face Spaces.
Both are great options but we’re going to take the second approach.
This is so we can create our Hugging Face Space right from this notebook.
To do so, we’ll create three files:
app.py
- This will be the Python file which will be the main running file on our Hugging Face Space. Inside we’ll include all the code necessary to run our Gradio demo (as above). Hugging Face Spaces will automatically recoginize theapp.py
file and run it for us.requirements.txt
- This text file will include all of the Python packages we need to run ourapp.py
file. Before our Space starts to run, all of the packages in this file will be installed.README.md
- This markdown file will include details about our Space as well as specific Space-related metadata (we’ll see this later on).
We’ll create these files with the following file structure:
demos/
└── food_not_food_text_classifier/
├── app.py
├── README.md
└── requirements.txt
Why this way?
Doing it in the above style means we’ll have a directory which contains all of our demos (demos/
) as well as a dedicated directory which contains our food
/not_food
demo application (food_not_food_text_classifier/
).
This way, we’ll be able to upload the whole demos/food_not_food_text_classifier/
folder to Hugging Face Spaces.
Let’s start by making a directory to store our demo application files.
from pathlib import Path
# Make a directory for demos
= Path("../demos")
demos_dir =True)
demos_dir.mkdir(exist_ok
# Create a folder for the food_not_food_text_classifer demo
= Path(demos_dir, "food_not_food_text_classifier")
food_not_food_text_classifier_demo_dir =True) food_not_food_text_classifier_demo_dir.mkdir(exist_ok
Demo directory created, let’s now create our requried files.
11.1 Making an app file
Our app.py
file will be the main part of our Hugging Face Space.
The good news is, we’ve already created most of it when we created our original demo.
Inside the app.py
folder we’ll:
- Import the required libraries/packages for running our demo app.
- Setup a function for going from text to our trained model’s predicted outputs. And because our model is already hosted on the Hugging Face Hub, we can pass
pipeline
our model’s name (e.g.mrdbourke/learn_hf_food_not_food_text_classifier-distilbert-base-uncased
) and when we upload ourapp.py
file to Hugging Face Spaces, it will load the model directly from the Hub.- Note: Be sure to change “
mrdbourke
” to your own Hugging Face username.
- Note: Be sure to change “
- Create a demo just as before with
gr.Interface
. - Launch our demo with
gr.Interface.launch
.
We can write all of the above in a notebook cell.
And we can turn it into a file by using the %%writefile
magic command and passing it our target filepath.
Let’s do it!
%%writefile ../demos/food_not_food_text_classifier/app.py
# 1. Import the required packages
import torch
import gradio as gr
from typing import Dict
from transformers import pipeline
# 2. Define function to use our model on given text
def food_not_food_classifier(text: str) -> Dict[str, float]:
# Set up text classification pipeline
= pipeline(task="text-classification",
food_not_food_classifier # Because our model is on Hugging Face already, we can pass in the model name directly
="mrdbourke/learn_hf_food_not_food_text_classifier-distilbert-base-uncased", # link to model on HF Hub
model="cuda" if torch.cuda.is_available() else "cpu",
device=None) # return all possible scores (not just top-1)
top_k
# Get outputs from pipeline (as a list of dicts)
= food_not_food_classifier(text)[0]
outputs
# Format output for Gradio (e.g. {"label_1": probability_1, "label_2": probability_2})
= {}
output_dict for item in outputs:
"label"]] = item["score"]
output_dict[item[
return output_dict
# 3. Create a Gradio interface with details about our app
= """
description A text classifier to determine if a sentence is about food or not food.
Fine-tuned from [DistilBERT](https://huggingface.co/distilbert/distilbert-base-uncased) on a [small dataset of food and not food text](https://huggingface.co/datasets/mrdbourke/learn_hf_food_not_food_image_captions).
See [source code](https://github.com/mrdbourke/learn-huggingface/blob/main/notebooks/hugging_face_text_classification_tutorial.ipynb).
"""
= gr.Interface(fn=food_not_food_classifier,
demo ="text",
inputs=gr.Label(num_top_classes=2), # show top 2 classes (that's all we have)
outputs="🍗🚫🥑 Food or Not Food Text Classifier",
title=description,
description=[["I whipped up a fresh batch of code, but it seems to have a syntax error."],
examples"A delicious photo of a plate of scrambled eggs, bacon and toast."]])
[
# 4. Launch the interface
if __name__ == "__main__":
demo.launch()
Overwriting ../demos/food_not_food_text_classifier/app.py
app.py
file created!
Now let’s setup the requirements file.
11.2 Making a requirements file
When you upload an app.py
file to Hugging Face Spaces, it will attempt to run it automatically.
And just like running the file locally, we need to make sure all of the required packages are available.
Otherwise our Space will produce an error like the following:
===== Application Startup at 2024-06-13 05:37:21 =====
Traceback (most recent call last):
File "/home/user/app/app.py", line 1, in <module>
import torch
ModuleNotFoundError: No module named 'torch'
Good news is, our demo only has three requirements: gradio
, torch
, transformers
.
Let’s create a requirements.txt
file with the packages we need and save it to the same directory as our app.py
file.
%%writefile ../demos/food_not_food_text_classifier/requirements.txt
gradio
torch transformers
Overwriting ../demos/food_not_food_text_classifier/requirements.txt
Beautiful!
Hugging Face Spaces will automatically recognize the requirements.txt
file and install the listed packages into our Space.
11.3 Making a README file
Our app.py
can contain information about our demo, however, we can also use a README.md
file to further communicate our work.
It is common practice in Git repositories (including GitHub and Hugging Face Hub) to add a README.md
file to your project so people can read more (hence “read me”) about what your project is about.
We can include anything in markdown-style text in the README.md
file.
However, Spaces also have a special YAML block at the top of the README.md
file in the root directory with configuration details.
Inside the YAML block you can put special metadata details about your Space including:
title
- The title of your Space (e.g.title: Food Not Food Text Classifier
).emoji
- The emoji to display on your Space (e.g.emoji: 🍗🚫🥑
).app_file
- The target app file for Spaces to run (set toapp_file: app.py
by default).
And there are plenty more in the Spaces Configuration References documentation.
Let’s create a README.md
file with a YAML block at the top detailing some of the metadata about our project.
The YAML block at the top of the README.md
can take some practice.
If you want to see a demo of how one gets created, try making a Hugging Face Space with the “Create new Space” button on the https://huggingface.co/spaces page and seeing what the README.md
file starts with (that’s how I found out what to do!).
%%writefile ../demos/food_not_food_text_classifier/README.md
---
title: Food Not Food Text Classifier
emoji: 🍗🚫🥑
colorFrom: blue
colorTo: yellow
sdk: gradio
app_file: app.py
pinned: false-2.0
license: apache---
# 🍗🚫🥑 Food Not Food Text Classifier
if a sentence is about food or not food.
Small demo to showcase a text classifier to determine
-tuned on a small synthetic dataset of 250 generated [Food or Not Food image captions](https://huggingface.co/datasets/mrdbourke/learn_hf_food_not_food_image_captions).
DistillBERT model fine
//github.com/mrdbourke/learn-huggingface/blob/main/notebooks/hugging_face_text_classification_tutorial.ipynb). [Source code notebook](https:
Overwriting ../demos/food_not_food_text_classifier/README.md
README.md
created!
Now let’s check out the files we have in our demos/food_not_food_text_classifier/
folder.
!ls ../demos/food_not_food_text_classifier
README.md app.py requirements.txt
Perfect!
Looks like we’ve got all the files we need to create our Space.
Let’s upload them to the Hugging Face Hub.
11.4 Uploading our demo to Hugging Face Spaces
We’ve created all of the files required for our demo, now for the fun part!
Let’s upload them to Hugging Face Spaces.
To do so programmatically, we can use the Hugging Face Hub Python API.
The Hugging Face Hub Python API has many different options for interacting with the Hugging Face Hub programmatically.
You can create repositories, upload files, upload folders, add comments, change permissions and much much more.
Be sure to explore the documentation for at least 10-15 minutes to get an idea of what’s possible.
To get our demo hosted on Hugging Face Spaces we’ll go through the following steps:
- Import the required methods from the
huggingface_hub
package, includingcreate_repo
,get_full_repo_name
,upload_file
(optional, we’ll be usingupload_folder
) andupload_folder
. - Define the demo folder we’d like to upload as well as the different parameters for the Hugging Face Space such as repo type (
"space"
), our target Space name, the target Space SDK ("gradio"
), our Hugging Face token with write access (optional if it already isn’t setup). - Create a repository on Hugging Face Spaces using the
huggingface_hub.create_repo
method and filling out the appropriate parameters. - Get the full name of our created repository using the
huggingface_hub.get_full_repo_name
method (we could hard code this but I like to get it programmatically incase things change). - Upload the contents of our target demo folder (
../demos/food_not_food_text_classifier/
) to Hugging Face Hub withhuggingface_hub.upload_folder
. - Hope it all works and inspect the results! 🤞
A fair few steps but we’ve got this!
# 1. Import the required methods for uploading to the Hugging Face Hub
from huggingface_hub import (
create_repo,
get_full_repo_name,# for uploading a single file (if necessary)
upload_file, # for uploading multiple files (in a folder)
upload_folder
)
# 2. Define the parameters we'd like to use for the upload
= "../demos/food_not_food_text_classifier"
LOCAL_DEMO_FOLDER_PATH_TO_UPLOAD = "learn_hf_food_not_food_text_classifier_demo"
HF_TARGET_SPACE_NAME = "space" # we're creating a Hugging Face Space
HF_REPO_TYPE = "gradio"
HF_SPACE_SDK = "" # optional: set to your Hugging Face token (but I'd advise storing this as an environment variable as previously discussed)
HF_TOKEN
# 3. Create a Space repository on Hugging Face Hub
print(f"[INFO] Creating repo on Hugging Face Hub with name: {HF_TARGET_SPACE_NAME}")
create_repo(=HF_TARGET_SPACE_NAME,
repo_id# token=HF_TOKEN, # optional: set token manually (though it will be automatically recognized if it's available as an environment variable)
=HF_REPO_TYPE,
repo_type=False, # set to True if you don't want your Space to be accessible to others
private=HF_SPACE_SDK,
space_sdk=True, # set to False if you want an error to raise if the repo_id already exists
exist_ok
)
# 4. Get the full repository name (e.g. {username}/{model_id} or {username}/{space_name})
= get_full_repo_name(model_id=HF_TARGET_SPACE_NAME)
full_hf_repo_name print(f"[INFO] Full Hugging Face Hub repo name: {full_hf_repo_name}")
# 5. Upload our demo folder
print(f"[INFO] Uploading {LOCAL_DEMO_FOLDER_PATH_TO_UPLOAD} to repo: {full_hf_repo_name}")
= upload_folder(
folder_upload_url =full_hf_repo_name,
repo_id=LOCAL_DEMO_FOLDER_PATH_TO_UPLOAD,
folder_path=".", # upload our folder to the root directory ("." means "base" or "root", this is the default)
path_in_repo# token=HF_TOKEN, # optional: set token manually
=HF_REPO_TYPE,
repo_type="Uploading food not food text classifier demo app.py"
commit_message
)print(f"[INFO] Demo folder successfully uploaded with commit URL: {folder_upload_url}")
[INFO] Creating repo on Hugging Face Hub with name: learn_hf_food_not_food_text_classifier_demo
[INFO] Full Hugging Face Hub repo name: mrdbourke/learn_hf_food_not_food_text_classifier_demo
[INFO] Uploading ../demos/food_not_food_text_classifier to repo: mrdbourke/learn_hf_food_not_food_text_classifier_demo
[INFO] Demo folder successfully uploaded with commit URL: https://huggingface.co/spaces/mrdbourke/learn_hf_food_not_food_text_classifier_demo/tree/main/.
Excellent!
Looks like all of the files in our target demo folder were uploaded!
Once this happens, Hugging Face Spaces will take a couple of minutes to build our application.
If there are any errors, it will let us know.
Otherwise, our demo application should be running live and be ready to test at a URL similar to: https://huggingface.co/spaces/mrdbourke/learn_hf_food_not_food_text_classifier_demo (though you may have to swap my username “mrdbourke
” for your own as well as the name you chose for the Space).
11.5 Testing our hosted demo
One of the really cool things about Hugging Face Spaces is that we can share our demo application as a link so others can try it out.
We can also embed it right into our notebook.
To do so, we can go to the three dots in the top right of our hosted Space and select “Embed this Space”.
We then have the option to embed our Space using a JavaScript web component, HTML iframe
or via the direct URL.
Since Jupyter notebooks have the ability to render HTML via IPython.display.HTML
, let’s embed our Space with HTML.
from IPython.display import HTML
# You can get embeddable HTML code for your demo by clicking the "Embed" button on the demo page
='''
HTML(data<iframe
src="https://mrdbourke-learn-hf-food-not-food-text-classifier-demo.hf.space"
frameborder="0"
width="850"
height="450"
></iframe>
''')
Now that’s cool!
We can try out our Food Not Food Text Classifier app from right within our notebook!
12 Summary
You should be very proud of yourself!
We’ve just gone end-to-end on a machine learning workflow with Hugging Face.
From loading a dataset to training a model to deploying that model in the form of a public demo.
Here are some of the main takeaways from this project.
The Hugging Face ecosystem is a collection of powerful and open-source tools for machine learning workflows.
- Hugging Face
datasets
helps you to store and preprocess datasets of almost any shape and size. - Hugging Face
transformers
has many built-in pretrained models for many different use cases and components such astransformers.Trainer
help you to tailor those models to your own custom use cases. - Hugging Face
tokenizers
works closely withtransformers
and allows the efficient conversion of raw text data into numerical representation (which is required for machine learning models). - The Hugging Face Hub is a great place to share your models and machine learning projects. Over time, you can build up a portfolio of machine learning-based projects to show future employers or clients and to help the community grow.
- There are many more, but I’ll leave these for you to explore as extra-curriculum.
A common machine learning workflow: dataset -> model -> demo.
Before a machine learning model is incorporated into a larger application, a very common workflow is to:
- Find an existing or create a new dataset for your specific problem.
- Train/fine-tune and evaluate an existing model on your dataset.
- Create a small demo application to test your trained model.
We’ve just gone through all of these steps for text classification!
Text classification is a very common problem in many business settings. If you have a similar problem but a different dataset, you can replicate this workflow.
Building your own model has several advantages over using APIs.
APIs are very helpful to try something out.
However, depending on your use case, you may often want to create your own custom model.
Training your own model can often result in faster predictions and far less running costs over time.
The Hugging Face ecosystem enables the creation of custom models for almost any kind of machine learning problem.
13 Exercises and Extensions
There’s no better way to improve other than practicing what you’ve learned.
The following exercises and extensions are designed for you to practice the things we’ve covered in this project.
- Our text classification model works on
food
/not_food
text samples. How would you create your own binary text classification model on different classes?- Create ~10 or samples of your own text classes (e.g. 10 samples each of
spam
/not_spam
emails) and retrain a text classification model. - Bonus: Share the model you’ve made in a demo just like we did here. Send it to me, I’d love to see it! My email is on my website.
- Create ~10 or samples of your own text classes (e.g. 10 samples each of
- We’ve trained our model on two classes (binary classification) but how might we increase that to 3 or more classes (multi-class classification)?
- Hint: see the
num_labels
parameter intransformers.AutoModelForSequenceClassification
.
- Hint: see the
- Our model seems to work pretty good on our test data and on the few number of examples we tried manually. Can you find any examples where our model fails? For example, what kind of sentences does it struggle with? How could you fix this?
- Hint: Our model has been trained on examples with at least 5-12 words, does it still work with short sentences? (e.g. “pie”).
- Bonus: If you find any cases where our model doesn’t perform well, make an extra 10-20 examples of these and add them to the dataset and then retrain the model (you’ll have to lookup “how to add rows to an existing Hugging Face dataset”). How does the model perform after adding these additional samples?
- Datasets are fundamental to any machine learning project, getting to know how to process and interact with them is a fundamental skill. Spend 1 hour going through the Hugging Face Datasets tutorial.
- Write 5 things you can do with Hugging Face Datasets and where they might come in handy.
- The Hugging Face
transformers
library has many features. The following readings are to help understand a handful of them.- Spend 10 minutes exploring the
transformers.TrainingArguments
documentation. - Spend 10 minutes reading the
transformers.Trainer
documentation.
- Spend 10 minutes reading the Hugging Face model sharing documentation.
- Spend 10 minutes reading the Hugging Face
transformers.pipeline
documentation.- What does a
pipeline
do? - Name 3 different kinds of pipelines and describe what they do in a sentence
- What does a
- Spend 10 minutes exploring the
- Gradio is a powerful library for making machine learning demos, learning more about it will help you in future creations. Spend 10-15 minutes reading the Gradio quickstart documentation.
- What are 3 kinds of demos you can create?
- What are 3 different inputs and outputs you can make?
14 Extra resources
There are many things we touched over but didn’t go into much depth in this notebook.
The following resources are for those who’d like to learn a little bit more.
- See how the food not food image caption dataset was created with synthetic text data (image captions generated by a Large Language Model) in the example Google Colab notebook.
- Hugging Face have a great guide on sequence classification (it’s what this notebook was built on).
- For more on the concept of padding and truncation in sequence processing, I’d recommend the Hugging Face padding and truncation guide.
- For more on Transformers (the architecture) as well as the DistilBert model:
- Read Transformers from scratch by Peter Bloem.
- Watch Andrej Karpathy’s lecture on Transformers and their history.
- Read the original Attention is all you need paper (the paper that introduced the Transformer architecture).
- Read the DistilBert paper from the Hugging Face team (paper that introduced the DistilBert architecture and training setup).