from IPython.display import HTML
"""
HTML(<iframe
src="https://mrdbourke-trashify-demo-v3.hf.space"
frameborder="0"
width="850"
height="850"
></iframe>
""")
[Work in Progress] Object Detection with Hugging Face Transformers Tutorial
Details:
- Goal: build several custom object detection models + deploy them as demos
- Status: Work in progress (code works but the annotations are in need of work)
- See example finished project demo: https://huggingface.co/spaces/mrdbourke/trashify_demo_v3
In progress:
- Going through and adding headings/annotations to each different section
- Start with overview + introduction + getting started, then go from there (e.g. intro the project and show where we’re going to end up) ✅
Later:
- Record video version of the materials
Source code on GitHub | Online book version | Setup guide | Video Course (coming soon)
1 TK - Overview
TK - Make an intro about being on the Trashify 🚮 team with a mission to make the world a cleaner place, trashify = using ML to incentivize people to pick up trash in their local area
Welcome to the Learn Hugging Face Object Detection project!
Inside this project, we’ll learn bits and pieces about the Hugging Face ecosystem as well as how to build our own custom object detection model.
We’ll start with a collection of images with bounding box files as our dataset, fine-tune an existing computer vision model to detect items in an image and then share our model as a demo others can use.
TK image - update cover image for object detectiondata:image/s3,"s3://crabby-images/e759e/e759e65d5d0ac34e44461b48e60b5f3d27ef06db" alt="Project overview image for 'Food Not Food' classification at Nutrify, a food app. The project involves building and deploying a binary text classification model to identify food-related text using Hugging Face Datasets, Transformers, and deploying with Hugging Face Hub/Spaces and Gradio. Examples include labels for 'A photo of sushi rolls on a white plate' (food), 'A serving of chicken curry in a blue bowl' (food), and 'A yellow tractor driving over a grassy hill' (not food). The process is visually depicted from data collection to model training and demo deployment."
–>
Feel to keep reading through the notebook but if you’d like to run the code yourself, be sure to go through the setup guide first.
1.1 TK - What we’re going to build
We’re going to be bulding Trashify 🚮, an object detection model which incentivises people to pick up trash in their local area by detecting bin
, trash
, hand
.
If all three items are detected, a person gets +1 point!
For example, say you were going for a walk around your neighbourhood and took a photo of yourself picking up a piece (with your hand or trash arm) of trash and putting it in the bin, you would get a point.
With this object detection model, you could deploy it to an application which would automatically detect the target classes and then save the result to an online leaderboard.
The incentive would be to score the most points, in turn, picking up the most piecces of trash, in a given area.
More specifically, we’re going to follow the following steps:
- Data: Problem defintion and dataset preparation - Getting a dataset/setting up the problem space.
- Model: Finding, training and evaluating a model - Finding an object detection model suitable for our problem on Hugging Face and customizing it to our own dataset.
- Demo: Creating a demo and put our model into the real world - Sharing our trained model in a way others can access and use.
By the end of this project, you’ll have a trained model and demo on Hugging Face you can share with others:
1.2 TK - What is object detection?
Object detection is the process of identifying and locating an item in an image.
Where item can mean almost anything.
For example:
- Detecting car licence plates in a video feed (videos are a series of images) for a parking lot entrance.
- Detecting delivery people walking towards your front door on a security camera.
- Detecting defects on a manufacturing line.
- Detecting pot holes in the road so repair works can automatically be scheduled.
- Detecting small pests (Varroa Mite) on the bodies of bees.
- Detecting weeds in a field so you know what to remove and what to keep.
–
TK - add examples of actual trash identification projects, see:
- Google using machine learning for trash identification — https://sustainability.google/operating-sustainably/stories/circular-economy-marketplace/
- Trashify website for identifying trash — https://www.trashify.tech/
- Waste management with deep learning — https://www.sciencedirect.com/science/article/abs/pii/S0956053X23001915
- Label Studio being used for labelling a trash dataset — https://labelstud.io/blog/ameru-labeling-for-a-greener-world/
–
Note: Object detection is also sometimes referred to as image localization or object localization. For consistency, I will use the term object detection, however, either of these terms could substitute.
* TK image - examples of where object detection is used |
Image classification deals with classifying an image as a whole into a single class
, object detection endeavours to find the specific target item and where it is in an image.
One of the most common ways of showing where an item is in an image is by displaying a bounding box (a rectangle-like box around the target item).
An object detection model will often take an input image tensor in the shape [3, 640, 640]
([colour_channels, height, width]
) and output a tensor in the form [class_name, x_min, y_min, x_max, y_max]
or [class_name, x1, y1, x2, y2]
(this is two ways to write the same example format, there are more formats, we’ll see these below in Table 1).
Where:
class_name
= The classification of the target item (e.g."car"
,"person"
,"banana"
,"piece_of_trash"
, this could be almost anything).x_min
= Thex
value of the top left corner of the box.y_min
= They
value of the top left corner of the box.x_max
= Thex
value of the bottom right corner of the box.y_max
= They
value of the bottom right corner of the box.
– TK image – example of a bounding box on an image
When you get into the world of object detection, you will find that there are several different bounding box formats.
There are three major formats you should be familiar with: XYXY
, XYWH
, CXCYWH
(there are more but these are the most common).
Knowing which bounding box format you’re working with can be the difference between a good model and a very poor model (wrong bounding boxes = wrong outcome).
We’ll get hands-on with a couple of these in this project.
But for an in-depth example of all three, I created a guide on different bounding box formats and how to draw them, reading this should give a good intuition behind each style of bounding box.
1.3 TK - Why train your own object detection models?
You can customize pre-trained models for object detection as well as API-powered models and LLMs such as Gemini, LandingAI and DINO-X.
Depending on your requirements, there are several pros and cons for using your own model versus using an API.
Training/fine-tuning your own model:
Pros | Cons |
---|---|
Control: Full control over model lifecycle. | Can be complex to get setup. |
No usage limits (aside from compute constraints). | Requires dedicated compute resources for training/inference. |
Can train once and deploy everywhere/whenever you want (for example, Tesla deploying a model to all self-driving cars). | Requires maintenance over time to ensure performance remains up to par. |
Privacy: Data can be kept in-house/app and doesn’t need to go to a third party. | Can require longer development cycles compared to using existing APIs. |
Speed: Customizing a small model for a specific use case often means it runs much faster on local hardware, for example, modern object detection models can achieve 70-100+ FPS (frames per second) on modern GPU hardware. |
Using a pre-built model API:
Pros | Cons |
---|---|
Ease of use: often can be setup within a few lines of code. | If the model API goes down, your service goes down. |
No maintenance of compute resources. | Data is required to be sent to a third-party for processing. |
Access to the most advanced models. | The API may have usage limits per day/time period. |
Can scale if usage increases. | Can be much slower than using dedicated models due to requiring an API call. |
For this project, we’re going to focus on fine-tuning our own model.
1.4 TK - Workflow we’re going to follow
The good news for us is that the Hugging Face ecosystem makes working on custom machine learning projects an absolute blast.
And workflow is reproducible across several kinds of projects.
Start with data (or skip this step and go straight to a model) -> get/customize a model -> build and share a demo.
With this in mind, our motto is data, model, demo!
More specifically, we’re going to follow the rough workflow of:
- Create, preprocess and load data using Hugging Face Datasets.
- Define the model we’d like use with
transformers.AutoModelForObjectDetection
(or another similar model class). - Define training arguments (these are hyperparameters for our model) with
transformers.TrainingArguments
. - Pass
TrainingArguments
from 3 and target datasets to an instance oftransformers.Trainer
. - Train the model by calling
Trainer.train()
. - Save the model (to our local machine or to the Hugging Face Hub).
- Evaluate the trained model by making and inspecting predctions on the test data.
- Turn the model into a shareable demo.
I say rough because machine learning projects are often non-linear in nature.
As in, because machine learning projects involve many experiments, they can kind of be all over the place.
But this worfklow will give us some good guidelines to follow.
data:image/s3,"s3://crabby-images/04f6a/04f6aabb43ca0ebe7f3d2672d711bd7478296ace" alt="The diagram shows the Hugging Face model development workflow, which includes the following steps: start with an idea or problem, get data ready (turn into tensors/create data splits), pick a pretrained model (to suit your problem), train/fine-tune the model on your custom data, evaluate the model, improve through experimentation, save and upload the fine-tuned model to the Hugging Face Hub, and turn your model into a shareable demo. Tools used in this workflow are Datasets/Tokenizers, Transformers/PEFT/Accelerate/timm, Hub/Spaces/Gradio, and Evaluate."
2 TK - Importing necessary libraries
Let’s get started!
First, we’ll import the required libraries.
If you’re running on your local computer, be sure to check out the getting setup guide to make sure you have everything you need.
If you’re using Google Colab, many of them the following libraries will be installed by default.
However, we’ll have to install a few extras to get everything working.
If you’re running on Google Colab, this notebook will work best with access to a GPU. To enable a GPU, go to Runtime
➡️ Change runtime type
➡️ Hardware accelerator
➡️ GPU
.
We’ll need to install the following libraries from the Hugging Face ecosystem:
transformers
- comes pre-installed on Google Colab but if you’re running on your local machine, you can install it viapip install transformers
.datasets
- a library for accessing and manipulating datasets on and off the Hugging Face Hub, you can install it viapip install datasets
.evaluate
- a library for evaluating machine learning model performance with various metrics, you can install it viapip install evaluate
.accelerate
- a library for training machine learning models faster, you can install it viapip install accelerate
.gradio
- a library for creating interactive demos of machine learning models, you can install it viapip install gradio
.
And the following library is not part of the Hugging Face ecosystem but it is helpful for evaluating our models:
torchmetrics
- a library containing many evaluation metrics compatible with PyTorch/Transformers, you can install it viapip install torchmetrics
.
We can also check the versions of our software with package_name.__version__
.
# Install/import dependencies (this is mostly for Google Colab, as the other dependences are available by default in Colab)
try:
import datasets, evaluate, accelerate
import gradio as gr
except ModuleNotFoundError:
!pip install -U datasets evaluate accelerate gradio # -U stands for "upgrade" so we'll get the latest version by default
import datasets, evaluate, accelerate
import gradio as gr
import random
import numpy as np
import torch
import transformers
# Required for evaluation
# Can install with !pip install torchmetrics[detection]
import torchmetrics
import pycocotools
# Check versions (as long as you've got the following versions or higher, you should be good)
print(f"Using transformers version: {transformers.__version__}")
print(f"Using datasets version: {datasets.__version__}")
print(f"Using torch version: {torch.__version__}")
print(f"Using torchmetrics version: {torchmetrics.__version__}")
Using transformers version: 4.48.3
Using datasets version: 3.1.0
Using torch version: 2.6.0+cu124
Using torchmetrics version: 1.4.1
Wonderful, as long as your versions are the same or higher to the versions above, you should be able to run the code below.
3 Getting a dataset
Okay, now we’re got the required libraries, let’s get a dataset.
Getting a dataset is one of the most important things a machine learning project.
The dataset you often determines the type of model you use as well as the quality of the outputs of that model.
Meaning, if you have a high quality dataset, chances are, your future model could also have high quality outputs.
It also means if your dataset is of poor quality, your model will likely also have poor quality outputs.
For an object detection problem, your dataset will likely come in the form of a group of images as well as a file with annotations belonging to those images.
For example, you might have the following setup:
folder_of_images/
image_1.jpeg
image_2.jpeg
image_3.jpeg
annotations.json
Where the annotations.json
contains details about the contains of each image:
annotations.json
[
{
'image_path': 'image_1.jpeg',
'image_id': 42,
'annotations':
{
'file_name': ['image_1.jpeg'],
'image_id': [42],
'category_id': [1],
'bbox': [
[360.20001220703125, 528.5, 177.1999969482422, 261.79998779296875],
],
'area': [46390.9609375]
},
'label_source': 'manual_prodigy_label',
'image_source': 'manual_taken_photo'
},
...(more labels down here)
]
Don’t worry too much about the exact meaning of everything in the above annotations.json
file for now (this is only one example, there are many different ways object detection information could be displayed).
The main point is that each target image is paired with an assosciated label.
Now like all good machine learning cooking shows, I’ve prepared a dataset from earlier.
TK image - dataset on Hugging Face
It’s stored on Hugging Face Datasets (also called the Hugging Face Hub) under the name mrdbourke/trashify_manual_labelled_images
.
This is a dataset I’ve collected manually by hand (yes, by picking up 1000+ pieces of trash and photographing it) as well as labelled by hand (by drawing boxes on each image with a labelling tool called Prodigy).
3.1 Loading the dataset
To load a dataset stored on the Hugging Face Hub we can use the datasets.load_dataset(path=NAME_OR_PATH_OF_DATASET)
function and pass it the name/path of the dataset we want to load.
In our case, our dataset name is mrdbourke/trashify_manual_labelled_images
(you can also change this for your own dataset).
And since our dataset is hosted on Hugging Face, when we run the following code for the first time, it will download it.
If your target dataset is quite large, this download may take a while.
However, once the dataset is downloaded, subsequent reloads will be mush faster.
One way to find out what a function or method does is to lookup the documentation.
Another way is to write the function/method name with a question mark afterwards.
For example:
from datasets import load_dataset
load_dataset?
Give it a try.
You should see some helpful information about what inputs the method takes and how they are used.
Let’s load our dataset and check it out.
from datasets import load_dataset
# Load our Trashify dataset
= load_dataset(path="mrdbourke/trashify_manual_labelled_images")
dataset
dataset
DatasetDict({
train: Dataset({
features: ['image', 'image_id', 'annotations', 'label_source', 'image_source'],
num_rows: 1128
})
})
Beautiful!
We can see that there is a train
split of the dataset already which currently contains all of the samples (1128
in total).
There are also some features
that come with our dataset which are related to our object detection goal.
print(f"[INFO] Length of original dataset: {len(dataset['train'])}")
print(f"[INFO] Dataset features:")
from pprint import pprint
'train'].features) pprint(dataset[
[INFO] Length of original dataset: 1128
[INFO] Dataset features:
{'annotations': Sequence(feature={'area': Value(dtype='float32', id=None),
'bbox': Sequence(feature=Value(dtype='float32',
id=None),
length=4,
id=None),
'category_id': ClassLabel(names=['bin',
'hand',
'not_bin',
'not_hand',
'not_trash',
'trash',
'trash_arm'],
id=None),
'file_name': Value(dtype='string', id=None),
'image_id': Value(dtype='int64', id=None),
'iscrowd': Value(dtype='int64', id=None)},
length=-1,
id=None),
'image': Image(mode=None, decode=True, id=None),
'image_id': Value(dtype='int64', id=None),
'image_source': Value(dtype='string', id=None),
'label_source': Value(dtype='string', id=None)}
Nice!
We can see our dataset features
contain the following fields:
annotations
- A sequence of values including abbox
field (short for bounding box) as well ascategory_id
field which contains the target objects we’d like to identify in our images (['bin', 'hand', 'not_bin', 'not_hand', 'not_trash', 'trash', 'trash_arm']
).image
- This contains the target image assosciated with a given set ofannotations
(in our case, images and annotations have been uploaded to the Hugging Face Hub together).image_id
- A unique ID assigned to a given sample.image_source
- Where the image came from (all of our images have been manually collected).label_source
- Where the image label came from (all of our images have been manually labelled).
3.2 Viewing a single sample from our data
Now we’ve seen the features, let’s check out a single sample from our dataset.
We can index on a single sample of the "train"
set just like indexing on a Python list.
# View a single sample of the dataset
"train"][42] dataset[
{'image': <PIL.Image.Image image mode=RGB size=960x1280>,
'image_id': 745,
'annotations': {'file_name': ['094f4f41-dc07-4704-96d7-8d5e82c9edb9.jpeg',
'094f4f41-dc07-4704-96d7-8d5e82c9edb9.jpeg',
'094f4f41-dc07-4704-96d7-8d5e82c9edb9.jpeg'],
'image_id': [745, 745, 745],
'category_id': [5, 1, 0],
'bbox': [[333.1000061035156,
611.2000122070312,
244.89999389648438,
321.29998779296875],
[504.0, 612.9000244140625, 451.29998779296875, 650.7999877929688],
[202.8000030517578,
366.20001220703125,
532.9000244140625,
555.4000244140625]],
'iscrowd': [0, 0, 0],
'area': [78686.3671875, 293706.03125, 295972.65625]},
'label_source': 'manual_prodigy_label',
'image_source': 'manual_taken_photo'}
We see a few more details here compared to just looking at the features.
We notice the image
is a PIL.Image
with size 960x1280
(width x height).
And the file_name
is a UUID (Universially Unique Identifier, made with uuid.uuid4()
).
The bbox
field in the annotations
key contains a list of bounding boxes assosciated with the image.
In this case, there are 3 different bounding boxes.
With the category_id
values of 5
, 1
, 0
(we’ll map these to class names shortly).
Let’s inspect a single bounding box.
"train"][42]["annotations"]["bbox"][0] dataset[
[333.1000061035156, 611.2000122070312, 244.89999389648438, 321.29998779296875]
This array gives us the coordinates of a single bounding box in the format XYWH
.
Where:
X
is the x-coordinate of the top left corner of the box (333.1
).Y
is the y-coordinate of the top left corner of the box (611.2
).W
is the width of the box (244.9
).H
is the height of the box (321.3
).
All of these values are in absolute pixel values (meaning an x-coordinate of 333.1
is 333.1
pixels across on the x-axis).
How do I know this?
I know this because I created the box labels and this is the default value Prodigy (the labelling tool I used) outputs boxes.
However, if you were to come across another bouding box dataset, one of the first steps would be to figure out what format your bounding boxes are in.
We’ll see more on bounding box formats shortly.
3.3 Extracting the category names from our data
Before we start to visualize our sample image and bounding boxes, let’s extract the category names from our dataset.
We can do so by accessing the features
attribute our of dataset
and then following it through to find the category_id
feature, this contains a list of our text-based class names.
When working with different categories, it’s good practice to get a list or mapping (e.g. a Python dictionary) from category name to ID and vice versa.
For example:
# Category to ID
"class_name": 0}
{
# ID to Category
0: "class_name"} {
Not all datasets will have this implemented in an easy to access way, so it might take a bit of research to get it created.
Let’s access the class names in our dataset and save them to a variable categories
.
# Get the categories from the dataset
# Note: This requires the dataset to have been uploaded with this information setup, not all datasets will have this available.
= dataset["train"].features["annotations"].feature["category_id"]
categories
# Get the names attribute
categories.names
['bin', 'hand', 'not_bin', 'not_hand', 'not_trash', 'trash', 'trash_arm']
Beautiful!
We get the following class names:
bin
- A rubbish bin or trash can.hand
- A person’s hand.not_bin
- Negative version ofbin
for items that look like abin
but shouldn’t be identified as one.not_hand
- Negative version ofhand
for items that look like ahand
but shouldn’t be identified as one.not_trash
- Negative version oftrash
for items that look liketrash
but shouldn’t be identified as it.trash
- An item of trash you might find on a walk such as an old plastic bottle, food wrapper, cigarette butt or used coffee cup.trash_arm
- A mechanical arm used for picking up trash.
The goal of our computer vision model will be: given an image, detect items belonging to these target classes if they are present.
3.4 Creating a mapping from numbers to labels
Now we’ve got our text-based class names, let’s create a mapping from label to ID and ID to label.
For each of these, Hugging Face use the terminology label2id
and id2label
respectively.
# Map ID's to class names and vice versa
= {i: class_name for i, class_name in enumerate(categories.names)}
id2label = {value: key for key, value in id2label.items()}
label2id
print(f"Label to ID mapping:\n{label2id}\n")
print(f"ID to label mapping:\n{id2label}")
# id2label, label2id
Label to ID mapping:
{'bin': 0, 'hand': 1, 'not_bin': 2, 'not_hand': 3, 'not_trash': 4, 'trash': 5, 'trash_arm': 6}
ID to label mapping:
{0: 'bin', 1: 'hand', 2: 'not_bin', 3: 'not_hand', 4: 'not_trash', 5: 'trash', 6: 'trash_arm'}
3.5 Creating a colour palette
Ok we know which class name matches to which ID, now let’s create a dictionary of different colours we can use to display our bounding boxes.
It’s one thing to plot bounding boxes, it’s another thing to make them look nice.
And we always want our plots looking nice!
We’ll colour the positive classes bin
, hand
, trash
, trash_arm
in nice bright colours.
And the negative classes not_bin
, not_hand
, not_trash
in a light red colour to indicate they’re the negative versions.
Our colour dictionary will map class_name
-> (red, green, blue)
(or RGB) colour values.
# Make colour dictionary
= {
colour_palette 'bin': (0, 0, 224), # Bright Blue (High contrast with greenery) in format (red, green, blue)
'not_bin': (255, 80, 80), # Light Red to indicate negative class
'hand': (148, 0, 211), # Dark Purple (Contrasts well with skin tones)
'not_hand': (255, 80, 80), # Light Red to indicate negative class
'trash': (0, 255, 0), # Bright Green (For trash-related items)
'not_trash': (255, 80, 80), # Light Red to indicate negative class
'trash_arm': (255, 140, 0), # Deep Orange (Highly visible)
}
Let’s check out what these colours look like!
It’s the ABV motto: Always Be Visualizing!
We can plot our colours with matplotlib
.
We’ll just have to write a small function to normalize our colour values from [0, 255]
to [0, 1]
(matplotlib
expects our colour values to be between 0 and 1).
import matplotlib.pyplot as plt
import numpy as np
# Normalize RGB values to 0-1 range
def normalize_rgb(rgb_tuple):
return tuple(x/255 for x in rgb_tuple)
# Turn colors into normalized RGB values for matplotlib
= [(key, normalize_rgb(value)) for key, value in colour_palette.items()]
colors_and_labels_rgb
# Create figure and axis
= plt.subplots(1, 7, figsize=(8, 1))
fig, ax
# Flatten the axis array for easier iteration
= ax.flatten()
ax
# Plot each color square
for idx, (label, color) in enumerate(colors_and_labels_rgb):
=(0, 0),
ax[idx].add_patch(plt.Rectangle(xy=1,
width=1,
height=color))
facecolor
ax[idx].set_title(label)0, 1)
ax[idx].set_xlim(0, 1)
ax[idx].set_ylim('off')
ax[idx].axis(
plt.tight_layout() plt.show()
Sensational!
Now we know what colours to look out for when we visualize our bounding boxes.
4 TK - Plotting a single image and visualizing the boxes
Okay, okay, finally time to plot an image!
Let’s take a random sample from our dataset
and plot the image as well as the box on it.
To save some space in our notebook (plotting many images can increase the size of our notebook dramatically), we’ll create two small helper functions:
half_image
- Halves the size of a given image.half_boxes
- Divides the input coordinates of a given input box by 2.
These functions aren’t 100% necessary in our workflow.
They’re just to make the images slightly smaller so they fit better in the notebook.
import PIL
def half_image(image: PIL.Image) -> PIL.Image:
"""
Resizes a given input image by half and returns the smaller version.
"""
return image.resize(size=(image.size[0] // 2, image.size[1] // 2))
def half_boxes(boxes):
"""
Halves an array/tensor of input boxes and returns them. Necessary for plotting them on a half-sized image.
For example:
boxes = [100, 100, 100, 100]
half_boxes = half_boxes(boxes)
print(half_boxes)
>>> [50, 50, 50, 50]
"""
if isinstance(boxes, list):
return [box // 2 for box in boxes]
if isinstance(boxes, np.ndarray):
return (boxes // 2)
if isinstance(boxes, torch.Tensor):
return (boxes // 2)
# Test the functions
= dataset["train"][42]["image"]
image_test = half_image(image_test)
image_test_half print(f"[INFO] Original image size: {image_test.size} | Half image size: {image_test_half.size}")
= [100, 100, 100, 100]
boxes_test_list print(f"[INFO] Original boxes: {boxes_test_list} | Half boxes: {half_boxes(boxes_test_list)}")
= torch.tensor([100.0, 100.0, 100.0, 100.0])
boxes_test_torch print(f"[INFO] Original boxes: {boxes_test_torch} | Half boxes: {half_boxes(boxes_test_torch)}")
[INFO] Original image size: (960, 1280) | Half image size: (480, 640)
[INFO] Original boxes: [100, 100, 100, 100] | Half boxes: [50, 50, 50, 50]
[INFO] Original boxes: tensor([100., 100., 100., 100.]) | Half boxes: tensor([50., 50., 50., 50.])
To plot an image and its assosciated boxes, we’ll do the following steps:
- Select a random sample from the
dataset
. - Extract the
"image"
(our image is inPIL
format) and"bbox"
keys from the random sample.- We can also optionally halve the size of our image/boxes to save space. In our case, we will halve our image and boxes.
- Turn the box coordinates into a
torch.tensor
(we’ll be usingtorchvision
utilities to plot the image and boxes). - Convert the box format from
XYXY
toXYWH
usingtorchvision.ops.box_convert
(we do this becausetorchvision.utils.draw_bounding_boxes
requiresXYXY
format as input). - Get a list of label names (e.g.
"bin", "trash"
, etc) assosciated with each of the boxes as well as a list of colours to match (these will be from ourcolour_palette
). - Draw the boxes on the target image by:
- Turning the image into a tensor with
torchvision.transforms.functional.pil_to_tensor
. - Draw the bounding boxes on our image tensor with
torchvision.utils.draw_bounding_boxes
. - Turn the image and bounding box tensors back into a
PIL
image withtorchvision.transforms.functional.pil_to_tensor
.
- Turning the image into a tensor with
Phew!
A fair few steps…
But we’ve got this!
If the terms XYXY
or XYWH
or all of the drawing methods sound a bit confusing or intimidating, don’t worry, there’s a fair bit going on here.
We’ll cover bounding box formats, such as XYXY
shortly.
In the meantime, if you want to learn more about different bounding box formats and how to draw them, I wrote A Guide to Bounding Box Formats and How to Draw Them which you might find helpful.
# Plotting a bounding box on a single image
import random
import torch
from torchvision.ops import box_convert
from torchvision.utils import draw_bounding_boxes
from torchvision.transforms.functional import pil_to_tensor, to_pil_image
# 1. Select a random sample from our dataset
= random.randint(0, len(dataset["train"]))
random_index print(f"[INFO] Showing training sample from index: {random_index}")
= dataset["train"][random_index]
random_sample
# 2. Get image and boxes from random sample
= random_sample["image"]
random_sample_image = random_sample["annotations"]["bbox"]
random_sample_boxes
# Optional: Half the image and boxes for space saving (all of the following code will work with/without half size images)
= half_image(random_sample_image)
half_random_sample_image = half_boxes(random_sample_boxes)
half_random_sample_boxes
# 3. Turn box coordinates in a tensor
= torch.tensor(half_random_sample_boxes)
boxes_xywh print(f"Boxes in XYWH format: {boxes_xywh}")
# 4. Convert boxes from XYWH -> XYXY
# torchvision.utils.draw_bounding_boxes requires input boxes in XYXY format (X_min, y_min, X_max, y_max)
= box_convert(boxes=boxes_xywh,
boxes_xyxy ="xywh",
in_fmt="xyxy")
out_fmtprint(f"Boxes XYXY: {boxes_xyxy}")
# 5. Get label names of target boxes and colours to match
= [categories.int2str(x) for x in random_sample["annotations"]["category_id"]]
random_sample_label_names = [colour_palette[label_name] for label_name in random_sample_label_names]
random_sample_colours print(f"Label names: {random_sample_label_names}")
print(f"Colour names: {random_sample_colours}")
# 6. Draw the boxes on the image as a tensor and then turn it into a PIL image
to_pil_image(=draw_bounding_boxes(
pic=pil_to_tensor(pic=half_random_sample_image),
image=boxes_xyxy,
boxes=random_sample_colours,
colors=random_sample_label_names,
labels=2,
width=random_sample_colours
label_colors
) )
[INFO] Showing training sample from index: 1126
Boxes in XYWH format: tensor([[173., 255., 182., 205.],
[295., 346., 132., 155.],
[102., 265., 189., 242.],
[354., 302., 120., 321.]], dtype=torch.float64)
Boxes XYXY: tensor([[173., 255., 355., 460.],
[295., 346., 427., 501.],
[102., 265., 291., 507.],
[354., 302., 474., 623.]], dtype=torch.float64)
Label names: ['trash', 'hand', 'bin', 'bin']
Colour names: [(0, 255, 0), (148, 0, 211), (0, 0, 224), (0, 0, 224)]
Outstanding!
Our first official bounding boxes plotted on an image!
Now the idea of Trashify 🚮 is coming to life.
Depending on the random sample you’re looking at, you should see some combination of ['bin', 'hand', 'not_bin', 'not_hand', 'not_trash', 'trash', 'trash_arm']
.
Our goal will be to build an object detection model to replicate these boxes on a given image.
Whenever working with a new dataset, I find it good practice to view 100+ random samples of the data.
In our case, this would mean viewing 100 random images with their bounding boxes drawn on them.
Doing so starts to build your own intuition of the data.
Using this intuition, along with evaluation metrics, you can start to get a better idea of how your model might be performing later on.
Keep this in mind for any new dataset or problem space you’re working on.
Start by looking at 100+ random samples.
And yes, generally more is better.
So you can practice by running the code cell above a number of times to see the different kinds of images and boxes in the dataset.
Can you think of any scenarios which the dataset might be missing?
5 Different bounding box formats
When drawing our bounding box, we discussed the terms XYXY
and XYWH
.
Well, we didn’t really discuss these at all…
But that’s why we’re here.
One of the most confusing things in the world of object detection is the different formats bounding boxes come in.
Are your boxes in XYXY
, XYWH
or CXCYWH
?
Are they in absolute format?
Or normalized format?
Perhaps a table will help us.
The following table contains a non-exhaustive list of some of the most common bounding box formats you’ll come across in the wild.
Box format | Description | Absolute Example | Normalized Example | Source |
---|---|---|---|---|
XYXY | Describes the top left corner coordinates (x1, y1) as well as the bottom right corner coordinates of a box. Also referred to as: [x1, y1, x2, y2] or [x_min, y_min, x_max, y_max] |
[8.9, 275.3, 867.5, 964.0] |
[0.009, 0.215, 0.904, 0.753] |
PASCAL VOC Dataset uses the absolute version of this format, torchvision.utils.draw_bounding_boxes defaults to the absolute version of this format. |
XYWH | Describes the top left corner coordinates (x1, y1) as well as the width (box_width ) and height (box_height ) of the target box. The bottom right corners (x2, y2) are found by adding the width and height to the top left corner coordinates (x1 + box_width, y1 + box_height) . Also referred to as: [x1, y1, box_width, box_height] or [x_min, y_min, box_width, box_height] |
[8.9, 275.3, 858.6, 688.7] |
[0.009, 0.215, 0.894, 0.538] |
The COCO (Common Objects in Context) dataset uses the absolute version of this format, see the section under “bbox”. |
CXCYWH | Describes the center coordinates of the bounding box (center_x, center_y) as well as the width (box_width ) and height (box_height ) of the target box. Also referred to as: [center_x, center_y, box_width, box_height] |
[438.2, 619.65, 858.6, 688.7] |
[0.456, 0.484, 0.894, 0.538] |
Normalized version introduced in the YOLOv3 (You Only Look Once) paper and is used by many later forms of YOLO. |
5.1 Absolute or normalized format?
In absolute coordinate form, bounding box values are in the same format as the width and height dimensions (e.g. our image is 960x1280
pixels).
For example in XYXY
format: ["bin", 8.9, 275.3, 867.5, 964.0]
An (x1, y1)
(or (x_min, y_min)
) coordinate of (8.9, 275.3)
means the top left corner is 8.9
pixels in on the x-axis, and 275.3
pixels down on the y-axis.
In normalized coordinate form, values are between [0, 1]
and are proportions of the image width and height.
For example in XYXY
format: ["bin", 0.009, 0.215, 0.904, 0.753]
A normalized (x1, y1)
(or (x_min, y_min)
) coordinate of (0.009, 0.215)
means the top left corner is 0.009 * image_width
pixels in on the x-axis and 0.215 * image_height
down on the y-axis.
To convert absolute coordinates to normalized, you can divide x-axis values by the image width and y-axis values by the image height.
\[ x_{\text{normalized}} = \frac{x_{\text{absolute}}}{\text{image\_width}} \quad y_{\text{normalized}} = \frac{y_{\text{absolute}}}{\text{image\_height}} \]
5.2 Which bounding box format should you use?
The bounding box format you use will depend on the framework, model and existing data you’re trying to use.
For example, the take the following frameworks:
- PyTorch - If you’re using PyTorch pre-trained models, for example,
torchvision.models.detection.fasterrcnn_resnet50_fpn
, you’ll want absoluteXYXY
([x1, y1, x2, y2]
) format. - Hugging Face Transformers - If you’re using a Hugging Face Transformers model such as Conditional DETR, you’ll want to take note that outputs from the model can be of one type (e.g.
CXCYWH
) but they can be post-processed into another type (e.g. absoluteXYXY
). - Ultralytics YOLO - If you’re using a YOLO-like model such as Ultralytics YOLO, you’ll want normalized
CXCYWH
([center_x, center_y, width, height]
) format. - Google Gemini - If you’re using Google Gemini to predict bounding boxes on your images, then you’ll want to pay attention to the special
[y_min, x_min, y_max, x_max]
(YXYX
) normalized coordinates.
Or if you note that someone has said their model is pre-trained on the COCO dataset, chances are the data has been formatted in XYWH
format (see Table 1).
For more on different bounding box formats and how to draw them, see A Guide to Bounding Box Formats and How to Draw Them.
# TK - should I functionize the plotting of boxes and image so we can do input/output with tensors + data augmentations on that (E.g. original: image, augmented: image),
# - is this needed?
6 Getting an object detection model
There are two main ways of getting an object detection model:
- Building it yourself. For example, constructing it layer by layer, testing it and training it on your target problem.
- Using an existing one. For example, find an existing model on a problem space similar to your own and then adapt it via transfer learning (TK - add link to glossary) to your own task.
In our case, we’re going to focus on the latter.
We’ll be taking a pre-trained object detection model and fine-tuning it on our Trashify 🚮 dataset so it outputs the boxes and labels we’re after.
6.1 Places to get object detection models
Instead of building your own machine learning model from scratch, it’s common practice to take an existing model that works on similar problem space to yours and then fine-tune (TK - add link to glossary) it to your own use case.
There are several places to get object detection models:
Location | Description |
---|---|
Hugging Face Hub | One the best places on the internet to find open-source machine learning models of nearly any kind. You can find pre-trained object detection models here such as facebook/detr-resnet-50 , a model from Facebook (Meta) and microsoft/conditional-detr-resnet-50 , a model from Microsoft and the model we’re going to use as our base model. Many of the models are permissively licensed, meaning you can use them for your own projects. |
torchvision |
PyTorch’s built-in domain library for computer vision has several pre-trained object detection models which you can use in your own workflows. |
paperswithcode.com/task/object-detection | Whilst not a direct place to download object detection models from, paperswithcode contains benchmarks for many machine learning tasks (including object detection) which shows the current state of the art (best performing) models and usually includes links to where to get the code. |
Detectron2 | Detectron2 is an open-source library to help with many of the tasks in detecting items in images. Inside you’ll find several pre-trained and adaptable models as well as utilities such as data loaders for object detection and segmentation tasks. |
YOLO Series | A running series of “You Only Look Once” models. Usually, the higher the number, the better performing. For example, YOLOv11 by Ultralytics should outperform YOLOv10 , however, this often requires testing on your own dataset. Beware of the license, it is under the AGPL-3.0 license which may cause issues in some organizations. |
mmdetection library |
An open-source library from the OpenMMLab which contains many different open-source models as well as detection-specific utilties. |
When you find a pre-trained object detection model, you’ll often see statements such as:
Conditional DEtection TRansformer (DETR) model trained end-to-end on COCO 2017 object detection (118k annotated images).
Source: https://huggingface.co/microsoft/conditional-detr-resnet-50
This means the model has already been trained on the COCO object detection dataset which contains 118,000 images and 80 classes such as ["cake", "person", "skateboard"...]
.
This is a good thing.
It means that the model should have a fairly good starting point when we try to adapt it to our own project.
6.2 Downloading our model from Hugging Face
For our Trashify 🚮 project we’re going to be using the pre-trained object detection model microsoft/conditional-detr-resnet-50
which was originally introduced in the paper Conditional DETR for Fast Training Convergence.
The term “DETR” stands for “DEtection TRansformer”.
Where “Transformer” refers to the Transformer neural network architecture, specifically the Vision Transformer (or ViT) rather than the Hugging Face transformers
library (quite confusing, yes).
So DETR means “performing detection with the Transformer architecture”.
And the “ResNet” part stands for “Residual Neural Network” which is a common computer vision backbone. The “50” refers to the number of layers in the network. Saying “ResNet-50” means the 50 layer version of ResNet. ResNet-101 and ResNet-18 are two other larger and smaller variants.
To use this model, there are some helpful documentation resources we should be aware of:
Resource | Description |
---|---|
Conditional DETR documentation | Contains detailed information on each of the transformers.ConditionalDetr classes. |
transformers.ConditionalDetrConfig |
Contains the configuration settings for our model such as number of layers and other hyperparameters. |
transformers.ConditionalDetrImageProcessor |
Contains several preprocessing on post processing functions and settings for data going into and out of our model. Here we can set values such as size in the preprocess method which will resize our images to a certain size. We can also use the post_process_object_detection method to process the raw outputs of our model into a more usable format. |
transformers.ConditionalDetrModelForObjectdetection |
This will enable us to load the Conditional DETR model weights and enable to pass data through them via the forward method. |
transformers.AutoImageProcessor |
This will enable us to create an instance of transformers.ConditionalDetrImageProcessor by passing the model name microsoft/conditional-detr-resnet-50 to the from_pretrained method. Hugging Face Transformers uses several Auto Classes for various problem spaces and models. |
transformers.AutoModelForObjectDetection |
Enables us to load the model architecture and weights for the Conditional DETR architecture by passing the model name microsoft/conditional-detr-resnet-50 to the from_pretrained method. |
We’ll get hands-on which each of these throughout the project.
For now, if you’d like to read up more on each, I’d highly recommend it.
Knowing how to navigate and read through a framework’s documentation is a very helpful skill to have.
There are other object detection models we could try on the Hugging Face Hub such as facebook/detr-resnet-50
or IDEA-Research/dab-detr-resnet-50-dc5-pat3
.
For now, we’ll stick with microsoft/conditional-detr-resnet-50
.
It’s easy to get stuck figuring out which model to use instead of just trying one and seeing how it goes.
Best to get something small working with one model and try another one later as part of a series of experiments to try and improve your results.
We can load our model with transformers.AutoModelForObjectDetection.from_pretrained
and passing in the following parameters:
pretrained_model_name_or_path
- Our target model, which can be a local path or Hugging Face model name (e.g.microsoft/conditional-detr-resnet-50
).label2id
- A dictionary mapping our class names/labels to their numerical ID, this is so our model will know how many classes to output.id2label
- A dictionary mapping numerical IDs to our class names/labels, so our model will know how many classes we’re working with and what their IDs are.ignore_mismatched_sizes=True
(default) - We’ll set this toTrue
so that our model can be instatiated with a varying number of classes compared to what it may have been trained on (e.g. if our model was trained on the 91 classes from COCO, we only need 7).backbone="resnet50"
(default) - We’ll tell our model what kind of computer vision backbone to use for extracting features from our images.
See the full documentation for a full list of parameters we can use.
Let’s create a model!
from transformers import AutoModelForObjectDetection
= "microsoft/conditional-detr-resnet-50"
MODEL_NAME
= AutoModelForObjectDetection.from_pretrained(
model =MODEL_NAME,
pretrained_model_name_or_path=label2id,
label2id=id2label,
id2label# Original model was trained with a different number of output classes to ours
# So we'll ignore any mismatched sizes (e.g. 91 vs. 7)
# Try turning this to False and see what happens
=True,
ignore_mismatched_sizes="resnet50"
backbone
)
# Uncomment to see full model archietecture
# model
Some weights of ConditionalDetrForObjectDetection were not initialized from the model checkpoint at microsoft/conditional-detr-resnet-50 and are newly initialized because the shapes did not match:
- class_labels_classifier.bias: found shape torch.Size([91]) in the checkpoint and torch.Size([7]) in the model instantiated
- class_labels_classifier.weight: found shape torch.Size([91, 256]) in the checkpoint and torch.Size([7, 256]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Beautiful!
We’ve got a model ready.
You might’ve noticed a warning about the model needing to be trained on a down-stream task:
Some weights of ConditionalDetrForObjectDetection were not initialized from the model checkpoint at microsoft/conditional-detr-resnet-50 and are newly initialized because the shapes did not match: - class_labels_classifier.bias: found shape torch.Size([91]) in the checkpoint and torch.Size([7]) in the model instantiated - class_labels_classifier.weight: found shape torch.Size([91, 256]) in the checkpoint and torch.Size([7, 256]) in the model instantiated You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
This is because our model has a different number of target classes (7 in total) comapred to the original model (91 in total, from the COCO dataset).
So in order to get this pretrained model to work on our dataset, we’ll need to fine-tune it.
You might also notice that if you set ignore_mismatched_sizes=False
, you’ll get an error:
RuntimeError: Error(s) in loading state_dict for ConditionalDetrForObjectDetection: size mismatch for class_labels_classifier.weight: copying a param with shape torch.Size([91, 256]) from checkpoint, the shape in current model is torch.Size([7, 256]). size mismatch for class_labels_classifier.bias: copying a param with shape torch.Size([91]) from checkpoint, the shape in current model is torch.Size([7]). You may consider adding
ignore_mismatched_sizes=True
in the modelfrom_pretrained
method.
This is a similar warning to the one above.
Keep this is mind for when you’re working with pretrained models.
If you are using data slightly different to what the model was trained on, you may need to alter the setup hyperparameters as well as fine-tune it on your own data.
6.3 Inspecting our model’s layers
We can inspect the full model architecture by running print(model)
(I’ve commented this out for brevity).
And if you do so, you’ll see a large list of layers which combine to contribute to make the overall model.
The following subset of layers has been truncated for brevity.
# Shortened version of the model architecture, print the full model to see all layers
ConditionalDetrForObjectDetection(
(model): ConditionalDetrModel(
(backbone): ConditionalDetrConvModel(
(conv_encoder): ConditionalDetrConvEncoder(
(model): FeatureListNet(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
(conv1): Conv2d(
(bn1): ConditionalDetrFrozenBatchNorm2d()
...
(layer1): Sequential(0): Bottleneck(
(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
(conv1): Conv2d(
(bn1): ConditionalDetrFrozenBatchNorm2d())))
...
(position_embedding): ConditionalDetrSinePositionEmbedding()
)2048, 256, kernel_size=(1, 1), stride=(1, 1))
(input_projection): Conv2d(300, 256)
(query_position_embeddings): Embedding(
(encoder): ConditionalDetrEncoder(
(layers): ModuleList(0-5): 6 x ConditionalDetrEncoderLayer(
(
(self_attn): DetrAttention(
...
)256,), eps=1e-05, elementwise_affine=True))))
(self_attn_layer_norm): LayerNorm((
(decoder): ConditionalDetrDecoder(
(layers): ModuleList(0): ConditionalDetrDecoderLayer(...)
(256,), eps=1e-05, elementwise_affine=True)
(layernorm): LayerNorm((
(query_scale): MLP(
(layers): ModuleList(0-1): 2 x Linear(in_features=256, out_features=256, bias=True)))
(
(ref_point_head): MLP(
...
))))=256, out_features=7, bias=True)
(class_labels_classifier): Linear(in_features
(bbox_predictor): ConditionalDetrMLPPredictionHead(
(layers): ModuleList(0-1): 2 x Linear(in_features=256, out_features=256, bias=True)
(2): Linear(in_features=256, out_features=4, bias=True))))) (
If we check out a few of our model’s layers, we can see that it is a combination of convolutional, attention, MLP (multi-layer perceptron) and linear layers.
I’ll leave exploring each of these layer types for extra-curriculum.
For now, think of them as progressively pattern extractors.
We’ll feed our input image into our model and layer by layer it will manipulate the pixel values to try and extract patterns in a way so that its internal parameters matches the image to its input annotations.
More specifically, if we dive into the final two layer sections:
class_labels_classifier
= classification head without_features=7
(one for each of our labels,'bin', 'hand', 'not_bin', 'not_hand', 'not_trash', 'trash', 'trash_arm']
).bbox_predictor
= regression head without_features=4
(one for each of our bbox coordinates, e.g.[center_x, center_y, width, height]
).
print(f"[INFO] Final classification layer: {model.class_labels_classifier}\n")
print(f"[INFO] Final box regression layer: {model.bbox_predictor}")
[INFO] Final classification layer: Linear(in_features=256, out_features=7, bias=True)
[INFO] Final box regression layer: ConditionalDetrMLPPredictionHead(
(layers): ModuleList(
(0-1): 2 x Linear(in_features=256, out_features=256, bias=True)
(2): Linear(in_features=256, out_features=4, bias=True)
)
)
These two layers are what are going to output the final predictions of our model in structure similar to our annotations.
The class_labels_classifier
will output the predicted class label of a given bounding box output from bbox_predictor
.
In essence, we are trying to get all of the pretrained patterns (also called parameters/weights & biases) of the previous layers to conform to the ideal outputs we’d like at the end.
6.4 Counting the number of parameters in our model
Parameters are individual values which contribute to a model’s final output.
Parameters are also referred to as weights and biases.
You can think of these individual weights as small pushes and pulls on the input data to get it to match the input annotations.
If our weights were perfect, we could input an image and always get back the correct bounding boxes and class labels.
It’s very unlikely to ever have perfect weights (unless your dataset is very small) but we can make them quite good (and useful).
When you have a good set of weights, this is known as a good representation.
Right now, our weights have been trained on COCO, a collection of 91 different common objects.
So they have a fairly good representation of detecting general common objects, however, we’d like to fine-tune these weights to detect our target objects.
Importantly, our model will not be starting from scratch when it begins to train.
It will instead take off from its existing knowledge of detecting common objects in images and try to adhere to our task.
When it comes to parameters and weights, generally, more is better.
Meaning the more parameters your model has, the better representation it can learn.
For example, ResNet50 (our computer vision backbone) has ~25 million parameters, about 100 MB in float32
precision or 50MB in float16
precision.
Whereas a model such as Llama-3.1-405B has ~405 billion parameters, about 1.45 TB in float32
precision or 740 GB in float16
precision, about 16,000x more than ResNet50.
However, as we can see having more parameters comes with the tradeoff of size and latency.
For each new parameter requires to be stored and it also adds an extra computation unit to your model.
In the case of Trashify, since we’d like our model to run on-device (e.g. make predictions live on an iPhone), we’d opt for the smallest number of parameters we could get acceptable results from.
If performance is your number 1 criteria and size and latency don’t matter, then you’d likely opt for the model with the largest number of parameters (though always evaluate these models on your own data, larger models are generally better, not always better).
Since our model is built using PyTorch, let’s write a small function to count the number of:
- Trainable parameters (parameters which will be tweaked during training)
- Non-trainable parameters (parameters which will not be tweaked during training)
- Total parameters (trainable parameters + non-trainable parameters)
# Count the number of parameters in the model
def count_parameters(model):
"""Takes in a PyTorch model and returns the number of parameters."""
= sum(p.numel() for p in model.parameters() if p.requires_grad)
trainable_parameters = sum(p.numel() for p in model.parameters() if not p.requires_grad)
non_trainable_parameters = sum(p.numel() for p in model.parameters())
total_parameters print(f"Total parameters: {total_parameters:,}")
print(f"Trainable parameters (will be updated): {trainable_parameters:,}")
print(f"Non-trainable parameters (will not be updated): {non_trainable_parameters:,}")
count_parameters(model)
Total parameters: 43,396,813
Trainable parameters (will be updated): 43,174,413
Non-trainable parameters (will not be updated): 222,400
Cool!
It looks like our model has a total of 43,396,813
parameters, of which, most of them are trainable.
This means that when we fine-tune our model later on, we’ll be tweaking the majority of the parameters to try and represent our data.
In practice, this is known as full fine-tuning, trying to fine-tune a large portion of the model to our data.
There are other methods for fine-tuning, such as feature extraction (where you only fine-tune the final layers of the model) and partial fine-tuning (where you fine-tune a portion of the model).
And even methods such as LoRA (Low-Rank Adaptation) which fine-tunes an adaptor matrix as a compliment to the model’s parameters.
6.5 Creating a function to build our model
Since machine learning is very experimental, we may want to create multiple instances of our model
to test various things.
So let’s functionize the creation of a new model with parameters for our target model name, id2label
and label2id
dictionaries.
# Setup the model
def create_model(pretrained_model_name_or_path: str = MODEL_NAME,
dict = label2id,
label2id: dict = id2label):
id2label: """Creates and returns an instance of AutoModelForObjectDetection.
Args:
pretrained_model_name_or_path (str): The name or path of the pretrained model to load.
Defaults to MODEL_NAME.
label2id (dict): A dictionary mapping class labels to IDs. Defaults to label2id.
id2label (dict): A dictionary mapping class IDs to labels. Defaults to id2label.
Returns:
AutoModelForObjectDetection: A pretrained model for object detection with number of output
classes equivalent to len(label2id).
"""
= AutoModelForObjectDetection.from_pretrained(
model =MODEL_NAME,
pretrained_model_name_or_path=label2id,
label2id=id2label,
id2label=True, # default
ignore_mismatched_sizes="resnet50", # default
backbone
)return model
Perfect!
And to make sure our function works…
# Create a new model instance
= create_model()
model # model
Some weights of ConditionalDetrForObjectDetection were not initialized from the model checkpoint at microsoft/conditional-detr-resnet-50 and are newly initialized because the shapes did not match:
- class_labels_classifier.bias: found shape torch.Size([91]) in the checkpoint and torch.Size([7]) in the model instantiated
- class_labels_classifier.weight: found shape torch.Size([91, 256]) in the checkpoint and torch.Size([7, 256]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
6.6 Trying to pass a single sample through our model (part 1)
Okay, now we’ve got a model, let’s put some data through it!
When we call our model
, because it’s a PyTorch Module (torch.nn.Module
) it will by default run the forward
method.
In PyTorch, forward
overrides the special __call__
method on functions.
So we can pass data into our model by running:
model(input_data)
Which is equivalent to running:
model.forward(input_data)
To see what happens when we call our model, let’s inspect the forward
method’s docstring with model.forward?
.
model
ConditionalDetrForObjectDetection(
(model): ConditionalDetrModel(
(backbone): ConditionalDetrConvModel(
(conv_encoder): ConditionalDetrConvEncoder(
(model): FeatureListNet(
(conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
(bn1): ConditionalDetrFrozenBatchNorm2d()
(act1): ReLU(inplace=True)
(maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
(layer1): Sequential(
(0): Bottleneck(
(conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): ConditionalDetrFrozenBatchNorm2d()
(act1): ReLU(inplace=True)
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): ConditionalDetrFrozenBatchNorm2d()
(drop_block): Identity()
(act2): ReLU(inplace=True)
(aa): Identity()
(conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): ConditionalDetrFrozenBatchNorm2d()
(act3): ReLU(inplace=True)
(downsample): Sequential(
(0): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): ConditionalDetrFrozenBatchNorm2d()
)
)
(1): Bottleneck(
(conv1): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): ConditionalDetrFrozenBatchNorm2d()
(act1): ReLU(inplace=True)
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): ConditionalDetrFrozenBatchNorm2d()
(drop_block): Identity()
(act2): ReLU(inplace=True)
(aa): Identity()
(conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): ConditionalDetrFrozenBatchNorm2d()
(act3): ReLU(inplace=True)
)
(2): Bottleneck(
(conv1): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): ConditionalDetrFrozenBatchNorm2d()
(act1): ReLU(inplace=True)
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): ConditionalDetrFrozenBatchNorm2d()
(drop_block): Identity()
(act2): ReLU(inplace=True)
(aa): Identity()
(conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): ConditionalDetrFrozenBatchNorm2d()
(act3): ReLU(inplace=True)
)
)
(layer2): Sequential(
(0): Bottleneck(
(conv1): Conv2d(256, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): ConditionalDetrFrozenBatchNorm2d()
(act1): ReLU(inplace=True)
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn2): ConditionalDetrFrozenBatchNorm2d()
(drop_block): Identity()
(act2): ReLU(inplace=True)
(aa): Identity()
(conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): ConditionalDetrFrozenBatchNorm2d()
(act3): ReLU(inplace=True)
(downsample): Sequential(
(0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): ConditionalDetrFrozenBatchNorm2d()
)
)
(1): Bottleneck(
(conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): ConditionalDetrFrozenBatchNorm2d()
(act1): ReLU(inplace=True)
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): ConditionalDetrFrozenBatchNorm2d()
(drop_block): Identity()
(act2): ReLU(inplace=True)
(aa): Identity()
(conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): ConditionalDetrFrozenBatchNorm2d()
(act3): ReLU(inplace=True)
)
(2): Bottleneck(
(conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): ConditionalDetrFrozenBatchNorm2d()
(act1): ReLU(inplace=True)
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): ConditionalDetrFrozenBatchNorm2d()
(drop_block): Identity()
(act2): ReLU(inplace=True)
(aa): Identity()
(conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): ConditionalDetrFrozenBatchNorm2d()
(act3): ReLU(inplace=True)
)
(3): Bottleneck(
(conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): ConditionalDetrFrozenBatchNorm2d()
(act1): ReLU(inplace=True)
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): ConditionalDetrFrozenBatchNorm2d()
(drop_block): Identity()
(act2): ReLU(inplace=True)
(aa): Identity()
(conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): ConditionalDetrFrozenBatchNorm2d()
(act3): ReLU(inplace=True)
)
)
(layer3): Sequential(
(0): Bottleneck(
(conv1): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): ConditionalDetrFrozenBatchNorm2d()
(act1): ReLU(inplace=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn2): ConditionalDetrFrozenBatchNorm2d()
(drop_block): Identity()
(act2): ReLU(inplace=True)
(aa): Identity()
(conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): ConditionalDetrFrozenBatchNorm2d()
(act3): ReLU(inplace=True)
(downsample): Sequential(
(0): Conv2d(512, 1024, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): ConditionalDetrFrozenBatchNorm2d()
)
)
(1): Bottleneck(
(conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): ConditionalDetrFrozenBatchNorm2d()
(act1): ReLU(inplace=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): ConditionalDetrFrozenBatchNorm2d()
(drop_block): Identity()
(act2): ReLU(inplace=True)
(aa): Identity()
(conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): ConditionalDetrFrozenBatchNorm2d()
(act3): ReLU(inplace=True)
)
(2): Bottleneck(
(conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): ConditionalDetrFrozenBatchNorm2d()
(act1): ReLU(inplace=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): ConditionalDetrFrozenBatchNorm2d()
(drop_block): Identity()
(act2): ReLU(inplace=True)
(aa): Identity()
(conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): ConditionalDetrFrozenBatchNorm2d()
(act3): ReLU(inplace=True)
)
(3): Bottleneck(
(conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): ConditionalDetrFrozenBatchNorm2d()
(act1): ReLU(inplace=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): ConditionalDetrFrozenBatchNorm2d()
(drop_block): Identity()
(act2): ReLU(inplace=True)
(aa): Identity()
(conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): ConditionalDetrFrozenBatchNorm2d()
(act3): ReLU(inplace=True)
)
(4): Bottleneck(
(conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): ConditionalDetrFrozenBatchNorm2d()
(act1): ReLU(inplace=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): ConditionalDetrFrozenBatchNorm2d()
(drop_block): Identity()
(act2): ReLU(inplace=True)
(aa): Identity()
(conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): ConditionalDetrFrozenBatchNorm2d()
(act3): ReLU(inplace=True)
)
(5): Bottleneck(
(conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): ConditionalDetrFrozenBatchNorm2d()
(act1): ReLU(inplace=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): ConditionalDetrFrozenBatchNorm2d()
(drop_block): Identity()
(act2): ReLU(inplace=True)
(aa): Identity()
(conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): ConditionalDetrFrozenBatchNorm2d()
(act3): ReLU(inplace=True)
)
)
(layer4): Sequential(
(0): Bottleneck(
(conv1): Conv2d(1024, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): ConditionalDetrFrozenBatchNorm2d()
(act1): ReLU(inplace=True)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn2): ConditionalDetrFrozenBatchNorm2d()
(drop_block): Identity()
(act2): ReLU(inplace=True)
(aa): Identity()
(conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): ConditionalDetrFrozenBatchNorm2d()
(act3): ReLU(inplace=True)
(downsample): Sequential(
(0): Conv2d(1024, 2048, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): ConditionalDetrFrozenBatchNorm2d()
)
)
(1): Bottleneck(
(conv1): Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): ConditionalDetrFrozenBatchNorm2d()
(act1): ReLU(inplace=True)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): ConditionalDetrFrozenBatchNorm2d()
(drop_block): Identity()
(act2): ReLU(inplace=True)
(aa): Identity()
(conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): ConditionalDetrFrozenBatchNorm2d()
(act3): ReLU(inplace=True)
)
(2): Bottleneck(
(conv1): Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): ConditionalDetrFrozenBatchNorm2d()
(act1): ReLU(inplace=True)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): ConditionalDetrFrozenBatchNorm2d()
(drop_block): Identity()
(act2): ReLU(inplace=True)
(aa): Identity()
(conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): ConditionalDetrFrozenBatchNorm2d()
(act3): ReLU(inplace=True)
)
)
)
)
(position_embedding): ConditionalDetrSinePositionEmbedding()
)
(input_projection): Conv2d(2048, 256, kernel_size=(1, 1), stride=(1, 1))
(query_position_embeddings): Embedding(300, 256)
(encoder): ConditionalDetrEncoder(
(layers): ModuleList(
(0-5): 6 x ConditionalDetrEncoderLayer(
(self_attn): DetrAttention(
(k_proj): Linear(in_features=256, out_features=256, bias=True)
(v_proj): Linear(in_features=256, out_features=256, bias=True)
(q_proj): Linear(in_features=256, out_features=256, bias=True)
(out_proj): Linear(in_features=256, out_features=256, bias=True)
)
(self_attn_layer_norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(activation_fn): ReLU()
(fc1): Linear(in_features=256, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=256, bias=True)
(final_layer_norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
)
)
(decoder): ConditionalDetrDecoder(
(layers): ModuleList(
(0): ConditionalDetrDecoderLayer(
(sa_qcontent_proj): Linear(in_features=256, out_features=256, bias=True)
(sa_qpos_proj): Linear(in_features=256, out_features=256, bias=True)
(sa_kcontent_proj): Linear(in_features=256, out_features=256, bias=True)
(sa_kpos_proj): Linear(in_features=256, out_features=256, bias=True)
(sa_v_proj): Linear(in_features=256, out_features=256, bias=True)
(self_attn): ConditionalDetrAttention(
(out_proj): Linear(in_features=256, out_features=256, bias=True)
)
(activation_fn): ReLU()
(self_attn_layer_norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(ca_qcontent_proj): Linear(in_features=256, out_features=256, bias=True)
(ca_qpos_proj): Linear(in_features=256, out_features=256, bias=True)
(ca_kcontent_proj): Linear(in_features=256, out_features=256, bias=True)
(ca_kpos_proj): Linear(in_features=256, out_features=256, bias=True)
(ca_v_proj): Linear(in_features=256, out_features=256, bias=True)
(ca_qpos_sine_proj): Linear(in_features=256, out_features=256, bias=True)
(encoder_attn): ConditionalDetrAttention(
(out_proj): Linear(in_features=256, out_features=256, bias=True)
)
(encoder_attn_layer_norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=256, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=256, bias=True)
(final_layer_norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
(1-5): 5 x ConditionalDetrDecoderLayer(
(sa_qcontent_proj): Linear(in_features=256, out_features=256, bias=True)
(sa_qpos_proj): Linear(in_features=256, out_features=256, bias=True)
(sa_kcontent_proj): Linear(in_features=256, out_features=256, bias=True)
(sa_kpos_proj): Linear(in_features=256, out_features=256, bias=True)
(sa_v_proj): Linear(in_features=256, out_features=256, bias=True)
(self_attn): ConditionalDetrAttention(
(out_proj): Linear(in_features=256, out_features=256, bias=True)
)
(activation_fn): ReLU()
(self_attn_layer_norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(ca_qcontent_proj): Linear(in_features=256, out_features=256, bias=True)
(ca_qpos_proj): None
(ca_kcontent_proj): Linear(in_features=256, out_features=256, bias=True)
(ca_kpos_proj): Linear(in_features=256, out_features=256, bias=True)
(ca_v_proj): Linear(in_features=256, out_features=256, bias=True)
(ca_qpos_sine_proj): Linear(in_features=256, out_features=256, bias=True)
(encoder_attn): ConditionalDetrAttention(
(out_proj): Linear(in_features=256, out_features=256, bias=True)
)
(encoder_attn_layer_norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=256, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=256, bias=True)
(final_layer_norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
)
(layernorm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(query_scale): MLP(
(layers): ModuleList(
(0-1): 2 x Linear(in_features=256, out_features=256, bias=True)
)
)
(ref_point_head): MLP(
(layers): ModuleList(
(0): Linear(in_features=256, out_features=256, bias=True)
(1): Linear(in_features=256, out_features=2, bias=True)
)
)
)
)
(class_labels_classifier): Linear(in_features=256, out_features=7, bias=True)
(bbox_predictor): ConditionalDetrMLPPredictionHead(
(layers): ModuleList(
(0-1): 2 x Linear(in_features=256, out_features=256, bias=True)
(2): Linear(in_features=256, out_features=4, bias=True)
)
)
)
# What happens when we call our model?
# Note: for PyTorch modules, `forward` overrides the __call__ method,
# so calling the model is equivalent to calling the forward method.
model.forward?
Output of model.forward?
Signature:
model.forward(
pixel_values: torch.FloatTensor,
pixel_mask: Optional[torch.LongTensor] = None,
decoder_attention_mask: Optional[torch.LongTensor] = None,
encoder_outputs: Optional[torch.FloatTensor] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
decoder_inputs_embeds: Optional[torch.FloatTensor] = None,
labels: Optional[List[dict]] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
) -> Union[Tuple[torch.FloatTensor], transformers.models.conditional_detr.modeling_conditional_detr.ConditionalDetrObjectDetectionOutput]
Docstring:
The [`ConditionalDetrForObjectDetection`] forward method, overrides the `__call__` special method.
<Tip>
Although the recipe for forward pass needs to be defined within this function, one should call the [`Module`]
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
</Tip>
Args:
pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
Pixel values. Padding will be ignored by default should you provide it.
Pixel values can be obtained using [`AutoImageProcessor`]. See [`ConditionalDetrImageProcessor.__call__`]
for details.
pixel_mask (`torch.LongTensor` of shape `(batch_size, height, width)`, *optional*):
Mask to avoid performing attention on padding pixel values. Mask values selected in `[0, 1]`:
- 1 for pixels that are real (i.e. **not masked**),
- 0 for pixels that are padding (i.e. **masked**).
[What are attention masks?](../glossary#attention-mask)
decoder_attention_mask (`torch.FloatTensor` of shape `(batch_size, num_queries)`, *optional*):
Not used by default. Can be used to mask object queries.
encoder_outputs (`tuple(tuple(torch.FloatTensor)`, *optional*):
Tuple consists of (`last_hidden_state`, *optional*: `hidden_states`, *optional*: `attentions`)
`last_hidden_state` of shape `(batch_size, sequence_length, hidden_size)`, *optional*) is a sequence of
hidden-states at the output of the last layer of the encoder. Used in the cross-attention of the decoder.
inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
Optionally, instead of passing the flattened feature map (output of the backbone + projection layer), you
can choose to directly pass a flattened representation of an image.
decoder_inputs_embeds (`torch.FloatTensor` of shape `(batch_size, num_queries, hidden_size)`, *optional*):
Optionally, instead of initializing the queries with a tensor of zeros, you can choose to directly pass an
embedded representation.
output_attentions (`bool`, *optional*):
Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
tensors for more detail.
output_hidden_states (`bool`, *optional*):
Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
more detail.
return_dict (`bool`, *optional*):
Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
labels (`List[Dict]` of len `(batch_size,)`, *optional*):
Labels for computing the bipartite matching loss. List of dicts, each dictionary containing at least the
following 2 keys: 'class_labels' and 'boxes' (the class labels and bounding boxes of an image in the batch
respectively). The class labels themselves should be a `torch.LongTensor` of len `(number of bounding boxes
in the image,)` and the boxes a `torch.FloatTensor` of shape `(number of bounding boxes in the image, 4)`.
Returns:
[`transformers.models.conditional_detr.modeling_conditional_detr.ConditionalDetrObjectDetectionOutput`] or `tuple(torch.FloatTensor)`: A [`transformers.models.conditional_detr.modeling_conditional_detr.ConditionalDetrObjectDetectionOutput`] or a tuple of
`torch.FloatTensor` (if `return_dict=False` is passed or when `config.return_dict=False`) comprising various
elements depending on the configuration ([`ConditionalDetrConfig`]) and inputs.
- **loss** (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` are provided)) -- Total loss as a linear combination of a negative log-likehood (cross-entropy) for class prediction and a
bounding box loss. The latter is defined as a linear combination of the L1 loss and the generalized
scale-invariant IoU loss.
- **loss_dict** (`Dict`, *optional*) -- A dictionary containing the individual losses. Useful for logging.
- **logits** (`torch.FloatTensor` of shape `(batch_size, num_queries, num_classes + 1)`) -- Classification logits (including no-object) for all queries.
- **pred_boxes** (`torch.FloatTensor` of shape `(batch_size, num_queries, 4)`) -- Normalized boxes coordinates for all queries, represented as (center_x, center_y, width, height). These
values are normalized in [0, 1], relative to the size of each individual image in the batch (disregarding
possible padding). You can use [`~ConditionalDetrImageProcessor.post_process_object_detection`] to retrieve the
unnormalized bounding boxes.
- **auxiliary_outputs** (`list[Dict]`, *optional*) -- Optional, only returned when auxilary losses are activated (i.e. `config.auxiliary_loss` is set to `True`)
and labels are provided. It is a list of dictionaries containing the two above keys (`logits` and
`pred_boxes`) for each decoder layer.
- **last_hidden_state** (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*) -- Sequence of hidden-states at the output of the last layer of the decoder of the model.
- **decoder_hidden_states** (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`) -- Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the decoder at the output of each
layer plus the initial embedding outputs.
- **decoder_attentions** (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`) -- Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`. Attentions weights of the decoder, after the attention softmax, used to compute the
weighted average in the self-attention heads.
- **cross_attentions** (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`) -- Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`. Attentions weights of the decoder's cross-attention layer, after the attention softmax,
used to compute the weighted average in the cross-attention heads.
- **encoder_last_hidden_state** (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*) -- Sequence of hidden-states at the output of the last layer of the encoder of the model.
- **encoder_hidden_states** (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`) -- Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the encoder at the output of each
layer plus the initial embedding outputs.
- **encoder_attentions** (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`) -- Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`. Attentions weights of the encoder, after the attention softmax, used to compute the
weighted average in the self-attention heads.
Examples:
```python
>>> from transformers import AutoImageProcessor, AutoModelForObjectDetection
>>> from PIL import Image
>>> import requests
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> image_processor = AutoImageProcessor.from_pretrained("microsoft/conditional-detr-resnet-50")
>>> model = AutoModelForObjectDetection.from_pretrained("microsoft/conditional-detr-resnet-50")
>>> inputs = image_processor(images=image, return_tensors="pt")
>>> outputs = model(**inputs)
>>> # convert outputs (bounding boxes and class logits) to Pascal VOC format (xmin, ymin, xmax, ymax)
>>> target_sizes = torch.tensor([image.size[::-1]])
>>> results = image_processor.post_process_object_detection(outputs, threshold=0.5, target_sizes=target_sizes)[
... 0
... ]
>>> for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
... box = [round(i, 2) for i in box.tolist()]
... print(
... f"Detected {model.config.id2label[label.item()]} with confidence "
... f"{round(score.item(), 3)} at location {box}"
... )
Detected remote with confidence 0.833 at location [38.31, 72.1, 177.63, 118.45]
Detected cat with confidence 0.831 at location [9.2, 51.38, 321.13, 469.0]
Detected cat with confidence 0.804 at location [340.3, 16.85, 642.93, 370.95]
Detected remote with confidence 0.683 at location [334.48, 73.49, 366.37, 190.01]
Detected couch with confidence 0.535 at location [0.52, 1.19, 640.35, 475.1]
```
File: ~/miniconda3/envs/ai/lib/python3.11/site-packages/transformers/models/conditional_detr/modeling_conditional_detr.py
Type: method
Running model.forward?
we can see that our model wants to take in pixel_values
as well as a pixel_mask
as arguments.
What happens if we try to pass in a single image from our random_sample
?
Let’s try!
It’s good practice to try and pass a single sample through your model as soon as possible to see what happens.
If we’re lucky, it’ll work.
If we’re really lucky, we’ll get an error message saying why it didn’t work (this is usually the case because rarely does raw data flow through a model without being preprocessed first).
We’ll do so by setting pixel_values
to our random_sample["image"]
and pixel_mask=None
.
# Do a single forward pass with the model
= model(pixel_values=random_sample["image"],
random_sample_outputs =None)
pixel_mask random_sample_outputs
Output of random_sample_outputs
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[34], line 2
1 # Do a single forward pass with the model
----> 2 random_sample_outputs = model(pixel_values=random_sample["image"],
3 pixel_mask=None)
4 random_sample_outputs
File ~/miniconda3/envs/ai/lib/python3.11/site-packages/torch/nn/modules/module.py:1739, in Module._wrapped_call_impl(self, *args, **kwargs)
1737 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1738 else:
-> 1739 return self._call_impl(*args, **kwargs)
File ~/miniconda3/envs/ai/lib/python3.11/site-packages/torch/nn/modules/module.py:1750, in Module._call_impl(self, *args, **kwargs)
1745 # If we don't have any hooks, we want to skip the rest of the logic in
1746 # this function, and just call forward.
1747 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1748 or _global_backward_pre_hooks or _global_backward_hooks
1749 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1750 return forward_call(*args, **kwargs)
1752 result = None
1753 called_always_called_hooks = set()
File ~/miniconda3/envs/ai/lib/python3.11/site-packages/transformers/models/conditional_detr/modeling_conditional_detr.py:1717, in ConditionalDetrForObjectDetection.forward(self, pixel_values, pixel_mask, decoder_attention_mask, encoder_outputs, inputs_embeds, decoder_inputs_embeds, labels, output_attentions, output_hidden_states, return_dict)
1714 return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1716 # First, sent images through CONDITIONAL_DETR base model to obtain encoder + decoder outputs
-> 1717 outputs = self.model(
1718 pixel_values,
1719 pixel_mask=pixel_mask,
1720 decoder_attention_mask=decoder_attention_mask,
1721 encoder_outputs=encoder_outputs,
1722 inputs_embeds=inputs_embeds,
1723 decoder_inputs_embeds=decoder_inputs_embeds,
1724 output_attentions=output_attentions,
1725 output_hidden_states=output_hidden_states,
1726 return_dict=return_dict,
1727 )
1729 sequence_output = outputs[0]
1731 # class logits + predicted bounding boxes
File ~/miniconda3/envs/ai/lib/python3.11/site-packages/torch/nn/modules/module.py:1739, in Module._wrapped_call_impl(self, *args, **kwargs)
1737 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1738 else:
-> 1739 return self._call_impl(*args, **kwargs)
File ~/miniconda3/envs/ai/lib/python3.11/site-packages/torch/nn/modules/module.py:1750, in Module._call_impl(self, *args, **kwargs)
1745 # If we don't have any hooks, we want to skip the rest of the logic in
1746 # this function, and just call forward.
1747 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1748 or _global_backward_pre_hooks or _global_backward_hooks
1749 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1750 return forward_call(*args, **kwargs)
1752 result = None
1753 called_always_called_hooks = set()
File ~/miniconda3/envs/ai/lib/python3.11/site-packages/transformers/models/conditional_detr/modeling_conditional_detr.py:1521, in ConditionalDetrModel.forward(self, pixel_values, pixel_mask, decoder_attention_mask, encoder_outputs, inputs_embeds, decoder_inputs_embeds, output_attentions, output_hidden_states, return_dict)
1516 output_hidden_states = (
1517 output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
1518 )
1519 return_dict = return_dict if return_dict is not None else self.config.use_return_dict
-> 1521 batch_size, num_channels, height, width = pixel_values.shape
1522 device = pixel_values.device
1524 if pixel_mask is None:
AttributeError: 'Image' object has no attribute 'shape'
Oh no!… I mean… Oh, yes!
We get an error:
AttributeError: ‘Image’ object has no attribute ‘shape’
Hmmm… it seems we’ve tried to pass a PIL.Image
to our model rather than a torch.FloatTensor
of shape (batch_size, num_channels, height, width)
.
It looks like our input data might require some preprocessing before we can pass it to our model.
7 Aside: Processor to Model Pattern
Many Hugging Face data loading and modelling workflows as well as machine learning workflows in general follow the pattern of:
- Data -> Preprocessor -> Model
TK image - can we make data -> preprocessor -> model look better? potentially a flow chart?
Meaning, the raw input data gets preprocessed or transformed in some way before being passed to a model.
Preprocessors and models are often loaded with an Auto Class.
An Auto Class pairs a preprocessor and model based on their model name or key.
For example:
from transformers import AutoProcessor, AutoModel
# Load raw data
= load_data()
raw_data
# Define target model name
= "..."
MODEL_NAME
# Load preprocessor and model (these two are often paired)
= AutoProcessor.from_pretrained(MODEL_NAME)
preprocessor = AutoModel.from_pretrained(MODEL_NAME)
model
# Preprocess data
= preprocessor.preprocess(raw_data)
preprocessed_data
# Pass preprocessed data to model
= model(preprocessed_data) output
This is the same for our Trashify 🚮 project.
We’ve got our raw data (images and bounding boxes), however, they need to be preprocessed in order for our model to be able to handle them.
Previously we tried to pass a sample of raw data to our model and this errored.
We can fix this by first preprocessing our raw data with our model’s pair preprocessor and then passing to our model again.
8 Loading our model’s processor
Time to get our raw data ready for our model!
To begin, let’s load our model’s processor.
We’ll use this to prepare our input images for the model.
To do so, we’ll use transformers.AutoImageProcessor
and pass our target model name to the from_pretrained
method.
from transformers import AutoImageProcessor
= "microsoft/conditional-detr-resnet-50"
MODEL_NAME # MODEL_NAME = "facebook/detr-resnet-50" # Could also use this model as an another experiment
# Load the image processor
= AutoImageProcessor.from_pretrained(pretrained_model_name_or_path=MODEL_NAME)
image_processor
# Check out the image processor
image_processor
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
ConditionalDetrImageProcessor {
"do_convert_annotations": true,
"do_normalize": true,
"do_pad": true,
"do_rescale": true,
"do_resize": true,
"format": "coco_detection",
"image_mean": [
0.485,
0.456,
0.406
],
"image_processor_type": "ConditionalDetrImageProcessor",
"image_std": [
0.229,
0.224,
0.225
],
"pad_size": null,
"resample": 2,
"rescale_factor": 0.00392156862745098,
"size": {
"longest_edge": 1333,
"shortest_edge": 800
}
}
Ok, a few things going on here.
These parameters will transform our input images before we pass them to our model.
One of the first things to see is the image_processor
is expecting our bounding boxes to be in COCO (or coco_detection
) format (this is the default).
We’ll see what this looks like later on but our processor wants this format because that’s the format our model has been trained on (it’s generally best practice to input data to a model in the same way its been trained on, otherwise you might get poor results).
Another thing to notice is that our input images will be resized to the values of the size
parameter.
In our case, it’s currently:
"size": {
"longest_edge": 1333,
"shortest_edge": 800
}
Which means that the longest edge will have size less or equal to 1333
and the shortest edge less or equal to 800
.
For simplicity, we’ll change this shortly to make both sides the same size.
You can read more about what each of these does in the transformers.ConditionalDetrImageProcessor
documentation.
Let’s update our instance of transformers.ConditionalDetrImageProcessor
with a few custom parameters:
do_convert_annotations=True
- This is the default and it will convert our boxes to the formatCXCYWH
or(center_x, center_y, width, height)
(see Table 1) in the range[0, 1]
.size
- We’ll update thesize
dictionary so all of our images have"longest_edge": 640
and"shortest_edge: 640"
. We’ll use a value of640
which is a common size in world of object detection. But there are also other sizes such as300x300
,480x480
,512x512
,800x800
and more.
Depending on what task you’re working on, you might want to tweak the image resolution you’re working with.
For example, I like this quote from Lucas Beyer, a former research scientist at DeepMind and engineer at OpenAI:
My conservative claim is that you can always stretch to a square, and for:
natural images, meaning most photos, 224px² is enough; text in photos, phone screens, diagrams and charts, 448px² is enough; desktop screens and single-page documents, 896px² is enough.
Typically, in the case of object detection, you’ll want to use a higher value.
But this is another thing that is open to experimentation.
# Set image size
= 640 # we could try other sizes here: 300x300, 480x480, 512x512, 640x640, 800x800 (best to experiment and see which works best)
IMAGE_SIZE
# Create a new instance of the image processor with the desired image size
= AutoImageProcessor.from_pretrained(
image_processor =MODEL_NAME,
pretrained_model_name_or_pathformat="coco_detection", # this is the default
=True, # defaults to True, converts boxes to (center_x, center_y, width, height) in range [0, 1]
do_convert_annotations={"shortest_edge": IMAGE_SIZE,
size"longest_edge": IMAGE_SIZE}
)
# Optional: View the docstring of our image_processor.preprocess function
# image_processor.preprocess?
# Check out our new image processor size
image_processor.size
{'shortest_edge': 640, 'longest_edge': 640}
Beautiful!
Now our images will be resized to a square of size 640x640
when we pass them to our model.
How about we try to preprocess our random_sample
?
To do so, we can pass its "image"
key and "annotations"
key to our image_processor
’s preprocess
method.
Let’s try!
# Try to process a single image and annotation pair (spoiler: this will error)
= image_processor.preprocess(images=random_sample["image"],
random_sample_preprocessed =random_sample["annotations"]) annotations
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[23], line 2 1 # Try to process a single image and annotation pair (spoiler: this will error) ----> 2 random_sample_preprocessed = image_processor.preprocess(images=random_sample["image"], 3 annotations=random_sample["annotations"]) File ~/miniconda3/envs/ai/lib/python3.11/site-packages/transformers/models/conditional_detr/image_processing_conditional_detr.py:1422, in ConditionalDetrImageProcessor.preprocess(self, images, annotations, return_segmentation_masks, masks_path, do_resize, size, resample, do_rescale, rescale_factor, do_normalize, do_convert_annotations, image_mean, image_std, do_pad, format, return_tensors, data_format, input_data_format, pad_size, **kwargs) 1420 format = AnnotationFormat(format) 1421 if annotations is not None: -> 1422 validate_annotations(format, SUPPORTED_ANNOTATION_FORMATS, annotations) 1424 if ( 1425 masks_path is not None 1426 and format == AnnotationFormat.COCO_PANOPTIC 1427 and not isinstance(masks_path, (pathlib.Path, str)) 1428 ): 1429 raise ValueError( 1430 "The path to the directory containing the mask PNG files should be provided as a" 1431 f" `pathlib.Path` or string object, but is {type(masks_path)} instead." 1432 ) File ~/miniconda3/envs/ai/lib/python3.11/site-packages/transformers/image_utils.py:851, in validate_annotations(annotation_format, supported_annotation_formats, annotations) 849 if annotation_format is AnnotationFormat.COCO_DETECTION: 850 if not valid_coco_detection_annotations(annotations): --> 851 raise ValueError( 852 "Invalid COCO detection annotations. Annotations must a dict (single image) or list of dicts " 853 "(batch of images) with the following keys: `image_id` and `annotations`, with the latter " 854 "being a list of annotations in the COCO format." 855 ) 857 if annotation_format is AnnotationFormat.COCO_PANOPTIC: 858 if not valid_coco_panoptic_annotations(annotations): ValueError: Invalid COCO detection annotations. Annotations must a dict (single image) or list of dicts (batch of images) with the following keys: `image_id` and `annotations`, with the latter being a list of annotations in the COCO format.
Oh no!
We get an error:
ValueError: Invalid COCO detection annotations. Annotations must a dict (single image) or list of dicts (batch of images) with the following keys:
image_id
andannotations
, with the latter being a list of annotations in the COCO format.
8.1 Preprocessing a single image
Okay so it turns out that our annotations aren’t in the format that the preprocess
method was expecting.
Since our pre-trained model was trained on the COCO dataset, the preprocess
method expects input data to be in line with the COCO format.
We can fix this later on by adjusting our annotations.
How about we try to preprocess just a single image instead?
# Preprocess our target sample
= image_processor.preprocess(images=random_sample["image"],
random_sample_preprocessed_image_only =None, # no annotations this time
annotations="pt") # return as PyTorch tensors
return_tensors
# Uncomment to see the full output
# print(random_sample_preprocessed_image_only)
# Print out the keys of the preprocessed image
print(random_sample_preprocessed_image_only.keys())
dict_keys(['pixel_values', 'pixel_mask'])
Nice! It looks like the preprocess
method works on a single image.
And it seems like we get a dictionary output with the following keys:
pixel_values
- the processed pixel values of the input image.pixel_mask
- a mask multiplier for the pixel values as to whether they should be paid attention to or not (a value of0
means the pixel value should be ignored by the model and a value of1
means the pixel value should be paid attention to by the model).
In our case, all values of the pixel_mask
are 1
since we’re not using any masks.
Let’s check.
PS Do you remember where we needed these keys? pixel_values
and pixel_mask
? Hint: it’s the reverse of drawrof.ledom
.
# Check all values of the pixel_mask are 1
all(random_sample_preprocessed_image_only["pixel_mask"][0]) == 1 torch.
tensor(True)
Beautiful!
Now how about we inspect our processed image’s shape?
# Uncomment to inspect all preprocessed pixel values
# print(random_sample_preprocessed_image_only["pixel_values"][0])
print(f"[INFO] Original image shape: {random_sample['image'].size} -> [width, height]")
print(f"[INFO] Preprocessed image shape: {random_sample_preprocessed_image_only['pixel_values'].shape} -> [batch_size, colour_channles, height, width]")
[INFO] Original image shape: (960, 1280) -> [width, height]
[INFO] Preprocessed image shape: torch.Size([1, 3, 640, 480]) -> [batch_size, colour_channles, height, width]
Ok wonderful, it looks like our image has been downsized to [3, 640, 480]
(1 item in the batch, 3 colour channels, 640 pixels high, 480 pixels wide).
This is down from its original size of [960, 1280]
(1280 pixels high, 960 pixels wide).
The order of image dimensions can differ between libraries and frameworks.
For example, image tensors in PyTorch typically follow the format [colour_channels, height, width]
whereas in TensorFlow they follow [height, width, colour_channels]
.
And in PIL, the format is [width, height]
.
As you can imagine, this can get confusing.
However, with some practice, you’ll be able to decipher which is which.
And if your images and bounding boxes start looking strange, perhaps checking the image dimension and format can help.
8.2 Trying to pass a single sample through our model (part 2)
This is exciting!
We’ve processed an image into the format our model is expecting.
How about we try another forward by calling model.forward(pixel_values, pixel_mask)
?
Which is the same as calling model(pixel_values, pixel_mask)
.
# Do a single forward pass with the model
= model(
random_sample_outputs =random_sample_preprocessed_image_only["pixel_values"], # model expects input [batch_size, color_channels, height, width]
pixel_values=random_sample_preprocessed_image_only["pixel_mask"],
pixel_mask
)
# Inspect the outputs
random_sample_outputs
ConditionalDetrObjectDetectionOutput(loss=None, loss_dict=None, logits=tensor([[[ 0.2620, -0.2777, 0.0839, ..., 0.1108, 0.1524, -0.0453],
[-0.0925, -0.0740, 0.2196, ..., 0.0273, 0.2322, 0.1993],
[ 0.2644, -0.3717, 0.1711, ..., -0.1531, 0.2759, 0.2365],
...,
[ 0.2135, -0.1681, 0.1053, ..., 0.2136, 0.3571, 0.3218],
[ 0.3133, -0.3826, 0.1889, ..., -0.2053, 0.1716, 0.4319],
[ 0.1135, -0.3630, 0.1966, ..., 0.1255, 0.2667, 0.3352]]],
grad_fn=<ViewBackward0>), pred_boxes=tensor([[[0.8308, 0.7621, 0.3302, 0.4626],
[0.6322, 0.0778, 0.3559, 0.1557],
[0.9358, 0.6082, 0.1232, 0.2065],
...,
[0.3369, 0.3165, 0.0488, 0.0207],
[0.9289, 0.5757, 0.1353, 0.1871],
[0.0385, 0.2647, 0.0765, 0.0530]]], grad_fn=<SigmoidBackward0>), auxiliary_outputs=None, last_hidden_state=tensor([[[ 0.4178, -0.4003, 0.1623, ..., -0.8785, -0.0225, 0.3085],
[ 0.4407, 0.1585, 0.7427, ..., -0.4346, 0.1030, 0.1669],
[ 0.0848, 0.0257, -0.4328, ..., -1.3943, 0.0823, 0.1586],
...,
[ 0.4692, -0.0750, 0.3395, ..., -0.0394, -0.0082, 0.5997],
[ 0.0701, -0.1430, -0.0892, ..., -1.2188, 0.3301, 0.2280],
[ 0.1688, -0.3394, 0.2426, ..., -0.8164, 0.0485, 0.6763]]],
grad_fn=<NativeLayerNormBackward0>), decoder_hidden_states=None, decoder_attentions=None, cross_attentions=None, encoder_last_hidden_state=tensor([[[-0.3609, 0.3855, -0.3594, ..., -0.5318, 0.4577, 0.2945],
[ 0.0948, 0.3784, -0.0511, ..., 0.3246, 0.1879, 0.3580],
[ 0.1088, 0.5034, -0.0875, ..., -0.0581, 0.1715, 0.2579],
...,
[ 0.1681, 0.5618, 0.0051, ..., 0.3749, 0.1115, 0.2559],
[ 0.1149, 0.0688, 0.1017, ..., 0.7759, 0.0183, 0.1238],
[ 0.0566, 0.2223, 0.1506, ..., -0.1509, 0.0468, 0.0529]]],
grad_fn=<NativeLayerNormBackward0>), encoder_hidden_states=None, encoder_attentions=None)
Nice!
It looks like it worked!
Our model processed our random_sample_preprocessed_image_only["pixel_values"]
and returned a ConditionalDetrObjectDetectionOutput
object as output.
Let’s inspect the keys()
method of this output and see what they are.
# Check the keys of the output
random_sample_outputs.keys()
odict_keys(['logits', 'pred_boxes', 'last_hidden_state', 'encoder_last_hidden_state'])
Breaking these down:
logits
- The raw outputs from the model, these are the classification logits we can later apply a softmax function/sigmoid function to to get prediction probabilties.pred_boxes
- Normalized box coordinates inCXCYWH
((center_x, center_y, width, height)
) format.last_hidden_state
- Last hidden state of the last decoder layer of the model.encoder_last_hidden_state
- Last hidden state of the last encoder layer of the model.
How about we inspect the shape
attribute of the logits
?
# Inspect logits output shape
= random_sample_outputs.logits
output_logits print(f"[INFO] Output logits shape: {output_logits.shape} -> [1 image, 300 boxes, 7 classes]")
[INFO] Output logits shape: torch.Size([1, 300, 7]) -> [1 image, 300 boxes, 7 classes]
Nice!
We get an output from our model that coincides with the shape of our data.
The final value of 7
in the output_logits
tensor is equivalent to the number of classes we have.
And the 300
is the number of boxes our model predicts for each image (this is defined by the num_queries
parameter of the transformers.ConditionalDetrConfig
, where num_queries=300
is the default).
# Inspect predicted boxes output shape
= random_sample_outputs.pred_boxes
output_pred_boxes print(f"[INFO] Output predicted boxes shape: {output_pred_boxes.shape} -> [1 image, 300 boxes, 4 coordinates (center_x, center_y, width, height)]")
[INFO] Output predicted boxes shape: torch.Size([1, 300, 4]) -> [1 image, 300 boxes, 4 coordinates (center_x, center_y, width, height)]
Reading the documentation for the forward
method, we can determine the output format of our models predicted boxes:
Returns:
pred_boxes (torch.FloatTensor of shape (batch_size, num_queries, 4)) — Normalized boxes coordinates for all queries, represented as (center_x, center_y, width, height). These values are normalized in [0, 1], relative to the size of each individual image in the batch (disregarding possible padding). You can use
post_process_object_detection()
to retrieve the unnormalized bounding boxes.
This is good to know!
It means that the raw output boxes from our model come in normalized CXCYWH
format (see Table 1 for more).
How about we inspect a single box?
# Single example predicted bounding box coordinates
print(f"[INFO] Example output box: {output_pred_boxes[:, 0, :][0].detach()} -> (center_x, center_y, width, height)")
[INFO] Example output box: tensor([0.8308, 0.7621, 0.3302, 0.4626]) -> (center_x, center_y, width, height)
Excellent!
We can process these boxes and logits later on into different formats using the transformers.ConditionalDetrImageProcessor.post_process_object_detection
method.
For now, let’s figure out how to preprocess our annotations.
9 TK Preprocessing our annotations
One of the most tricky parts of any machine learning problem is getting your data in the right format.
We’ve done it for our images.
Now let’s do it for our annotations.
9.1 Trying to preprocess a single annotation
Recall in a previous section we tried to preprocess a single image and its annotation.
And we got an error.
Let’s make sure we’re not crazy and this is still the case.
# Preprocess a single image and annotation pair
image_processor.preprocess(=random_sample["image"],
images=random_sample["annotations"]
annotations )
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[32], line 2 1 # Preprocess a single image and annotation pair ----> 2 image_processor.preprocess( 3 images=random_sample["image"], 4 annotations=random_sample["annotations"] 5 ) File ~/miniconda3/envs/ai/lib/python3.11/site-packages/transformers/models/conditional_detr/image_processing_conditional_detr.py:1422, in ConditionalDetrImageProcessor.preprocess(self, images, annotations, return_segmentation_masks, masks_path, do_resize, size, resample, do_rescale, rescale_factor, do_normalize, do_convert_annotations, image_mean, image_std, do_pad, format, return_tensors, data_format, input_data_format, pad_size, **kwargs) 1420 format = AnnotationFormat(format) 1421 if annotations is not None: -> 1422 validate_annotations(format, SUPPORTED_ANNOTATION_FORMATS, annotations) 1424 if ( 1425 masks_path is not None 1426 and format == AnnotationFormat.COCO_PANOPTIC 1427 and not isinstance(masks_path, (pathlib.Path, str)) 1428 ): 1429 raise ValueError( 1430 "The path to the directory containing the mask PNG files should be provided as a" 1431 f" `pathlib.Path` or string object, but is {type(masks_path)} instead." 1432 ) File ~/miniconda3/envs/ai/lib/python3.11/site-packages/transformers/image_utils.py:851, in validate_annotations(annotation_format, supported_annotation_formats, annotations) 849 if annotation_format is AnnotationFormat.COCO_DETECTION: 850 if not valid_coco_detection_annotations(annotations): --> 851 raise ValueError( 852 "Invalid COCO detection annotations. Annotations must a dict (single image) or list of dicts " 853 "(batch of images) with the following keys: `image_id` and `annotations`, with the latter " 854 "being a list of annotations in the COCO format." 855 ) 857 if annotation_format is AnnotationFormat.COCO_PANOPTIC: 858 if not valid_coco_panoptic_annotations(annotations): ValueError: Invalid COCO detection annotations. Annotations must a dict (single image) or list of dicts (batch of images) with the following keys: `image_id` and `annotations`, with the latter being a list of annotations in the COCO format.
Wonderful!
We’re not crazy…
But we still get an error:
ValueError: Invalid COCO detection annotations. Annotations must a dict (single image) or list of dicts (batch of images) with the following keys:
image_id
andannotations
, with the latter being a list of annotations in the COCO format.
In this section, we’re going to fix it.
9.2 Discussing the format our annotations need to be in
According the error we got in the previous segment, the transformers.ConditionalDetrImageProcessor.preprocess
method expects input annotations in COCO format.
In the documentation we can read that the annotations
parameter taks in a list of dictionaries with the following keys:
"image_id"
(int
): The image id."annotations"
(List[Dict]
): List of annotations for an image. Each annotation should be a dictionary. An image can have no annotations, in which case the list should be empty.
As for the "annotations"
field, this should be a list of dictionaries containing individual annotations in COCO format:
# COCO format, see: https://cocodataset.org/#format-data
[{"image_id": 42,
"annotations": [{
"id": 123456,
"category_id": 1,
"iscrowd": 0,
"segmentation": [
42.0, 55.6, ... 99.3, 102.3]
[
],"image_id": 42, # this matches the 'image_id' field above
"area": 135381.07,
"bbox": [523.70,
545.09,
402.79,
336.11]
},# Next annotation in the same format as the previous one (one annotation per dict).
# For example, if an image had 4 bounding boxes, there would be a list of 4 dictionaries
# each containing a single annotation.
...] }]
Let’s breakdown each of the fields in the COCO annotation:
Field | Requirement | Data Type | Description |
---|---|---|---|
image_id (top-level) |
Required | Integer | ID of the target image. |
annotations |
Required | List[Dict] | List of dictionaries with one box annotation per dict. Can be empty if there are no boxes. |
id |
Not required | Integer | ID of the particular annotation. |
category_id |
Required | Integer | ID of the class the box relates to (e.g. {0: 'bin', 1: 'hand', 2: 'not_bin', 3: 'not_hand', 4: 'not_trash', 5: 'trash'} ). |
segmentation |
Not required | List or None | Segmentation mask related to an annotation instance. Focus is on boxes, not segmentation. |
image_id (inside annotations field) |
Required | Integer | ID of the target image the particular box relates to, should match image_id on the top-level field. |
area |
Not required | Float | Area of the target bounding box (e.g. box height * width). |
bbox |
Required | List[Float] | Coordinates of the target bounding box in XYWH ([x, y, width, height] ) format. (x, y) are the top left corner coordinates, width and height are dimensions. |
is_crowd |
Not required | Int | Boolean flag (0 or 1) to indicate whether or not an object is multiple (a crowd) of the same thing. For example, a crowd of “people” or a group of “apples” rather than a single apple. |
And now our annotation data comes in the format:
'image': <PIL.Image.Image image mode=RGB size=960x1280>,
{'image_id': 292,
'annotations': {'file_name': ['00347467-13f1-4cb9-94aa-4e4369457e0c.jpeg',
'00347467-13f1-4cb9-94aa-4e4369457e0c.jpeg'],
'image_id': [292, 292],
'category_id': [1, 0],
'bbox': [[523.7000122070312,
545.0999755859375,
402.79998779296875,
336.1000061035156],
10.399999618530273,
[163.6999969482422,
943.4000244140625,
1101.9000244140625]],
'iscrowd': [0, 0],
'area': [135381.078125, 1039532.4375]},
'label_source': 'manual_prodigy_label',
'image_source': 'manual_taken_photo'}
How about we write some code to convert our current annotation format to COCO format?
It’s common practice to get a dataset in a certain format and then have to preprocess it into another format before you can use it with a model.
We’re getting hands-on and practicing here so when it comes to working on converting another dataset, you’ve already had some practice.
9.3 Creating dataclasses to represent the COCO bounding box format
Let’s write some code to transform our existing annotation data into the format required by transformers.ConditionalDetrImageProcessor.preprocess
.
We’ll start by creating two Python dataclasses to house our desired COCO annotation format.
To do this we’ll:
- Create
SingleCOCOAnnotation
which contains the format structure of a single COCO annotation. - Create
ImageCOCOAnnotations
which contains all of the annotations for a given image in COCO format. This may be a single instance ofSingleCOCOAnnotation
or multiple.
We’ll decorate both of these with the @dataclass
decorator.
Using a @dataclass
gives several benefits:
- Type hints - we can define the types of objects we want in the class definition, for example, we want
image_id
to be anint
. - Helpful built-in methods - we can use methods such as
asdict
to convert our@dataclass
into a dictionary (COCO wants lists of dictionaries). - Data validation - we can use methods such as
__post_init__
to run checks on our@dataclass
as it’s initialized, for example, we always want the length ofbbox
to be 4 (bounding box coordinates inXYWH
format).
from dataclasses import dataclass, asdict
from typing import List, Tuple
# 1. Create a dataclass for a single COCO annotation
@dataclass
class SingleCOCOAnnotation:
"""An instance of a single COCO annotation.
Represent a COCO-formatted (see: https://cocodataset.org/#format-data) single instance of an object
in an image.
Attributes:
image_id: Unique integer identifier for the image which the annotation belongs to.
category_id: Integer identifier for the target object label/category (e.g. "0" for "bin").
bbox: List of floats containing target bounding box coordinates in absolute XYWH format ([x_top_left, y_top_left, width, height]).
area: Area of the target bounding box. Defaults to 0.0.
iscrowd: Boolean flag (0 or 1) indicating whether the target is a crowd of objects, for example, a group of
apples rather than a single apple. Defaults to 0.
"""
int
image_id: int
category_id: float] # bboxes in XYWH format ([x_top_left, y_top_left, width, height])
bbox: List[float = 0.0
area: int = 0
iscrowd:
# Make sure the bbox is always a list of 4 values (XYWH format)
def __post_init__(self):
if len(self.bbox) != 4:
raise ValueError(f"bbox must contain exactly 4 values, current length: {len(self.bbox)}")
# 2. Create a dataclass for a collection of COCO annotations for a single image
@dataclass
class ImageCOCOAnnotations:
"""A collection of COCO annotations for a single image_id.
Attributes:
image_id: Unique integer identifier for the image which the annotations belong to.
annotations: List of SingleCOCOAnnotation instances.
"""
int
image_id: annotations: List[SingleCOCOAnnotation]
Beautiful!
Let’s now inspect our SingleCOCOAnnotation
dataclass.
We can use the SingleCOCOAnnotation?
syntax to view the docstring of the class.
# One of the benefits of using a dataclass is that we can inspect the attributes with the `?` syntax
SingleCOCOAnnotation?
Init signature:
SingleCOCOAnnotation(
image_id: int,
category_id: int,
bbox: List[float],
area: float = 0.0,
iscrowd: int = 0,
) -> None
Docstring:
An instance of a single COCO annotation.
Represent a COCO-formatted (see: https://cocodataset.org/#format-data) single instance of an object
in an image.
Attributes:
image_id: Unique integer identifier for the image which the annotation belongs to.
category_id: Integer identifier for the target object label/category (e.g. "0" for "bin").
bbox: List of floats containing target bounding box coordinates in absolute XYWH format ([x_top_left, y_top_left, width, height]).
area: Area of the target bounding box. Defaults to 0.0.
iscrowd: Boolean flag (0 or 1) indicating whether the target is a crowd of objects, for example, a group of
apples rather than a single apple. Defaults to 0.
Type: type
Subclasses:
We can also see the error handling of our __post_init__
method in action by trying to create an instance of SingleCOCOAnnotation
with an incorrect number of bbox values.
# Let's try our SingleCOCOAnnotation dataclass (this will error since the bbox doesn't have 4 values)
=42,
SingleCOCOAnnotation(image_id=0,
category_id=[100, 100, 100]) # missing a 4th value bbox
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[35], line 2 1 # Let's try our SingleCOCOAnnotation dataclass (this will error since the bbox doesn't have 4 values) ----> 2 SingleCOCOAnnotation(image_id=42, 3 category_id=0, 4 bbox=[100, 100, 100]) # missing a 4th value File <string>:8, in __init__(self, image_id, category_id, bbox, area, iscrowd) Cell In[33], line 29, in SingleCOCOAnnotation.__post_init__(self) 27 def __post_init__(self): 28 if len(self.bbox) != 4: ---> 29 raise ValueError(f"bbox must contain exactly 4 values, current length: {len(self.bbox)}") ValueError: bbox must contain exactly 4 values, current length: 3
And now if we pass the correct number of values to our SingleCOCOAnnotation
, it should work.
=42,
SingleCOCOAnnotation(image_id=0,
category_id=[100, 100, 100, 100]) # correct number of values bbox
SingleCOCOAnnotation(image_id=42, category_id=0, bbox=[100, 100, 100, 100], area=0.0, iscrowd=0)
9.4 Creating a function to format our annotations as COCO format
Now we’ve got the COCO data format in our SingleCOCOAnnotation
and ImageCOCOAnnotation
dataclasses, let’s write a function to take our existing image annotations and format them in COCO style.
Our format_image_annotations_as_coco
function will:
- Take in an
image_id
to represent a unique identifier for the image as well as lists of category integers, area values and bounding box coordinates. - Perform a list comprehension on a zipped version of each category, area and bounding box coordinate value in the input lists creating an instance of
SingleCOCOAnnotation
as a dictionary (using theasdict
method) each time, this will give us a list ofSingleCOCOAnnotation
formatted dictionaries. - Return a dictionary version of
ImageCOCOAnnotations
usingasdict
passing it theimage_id
as well as list ofSingleCOCOAnnotation
dictionaries from 2.
Why does our function take in lists of categories, areas and bounding boxes?
Because that’s the current format our existing annotations are in (how we downloaded them from Hugging Face in the beginning).
Let’s do it!
# 1. Take in a unique image_id as well as lists of categories, areas, and bounding boxes
def format_image_annotations_as_coco(
int,
image_id: int],
categories: List[float],
areas: List[float, float, float, float]] # bboxes in XYWH format ([x_top_left, y_top_left, width, height])
bboxes: List[Tuple[-> dict:
) """Formats lists of image annotations into COCO format.
Takes in parallel lists of categories, areas, and bounding boxes and
then formats them into a COCO-style dictionary of annotations.
Args:
image_id: Unique integer identifier for an image.
categories: List of integer category IDs for each annotation.
areas: List of float areas for each annotation.
bboxes: List of tuples containing bounding box coordinates in XYWH format
([x_top_left, y_top_left, width, height]).
Returns:
A dictionary of image annotations in COCO format with the following structure:
{
"image_id": int,
"annotations": [
{
"image_id": int,
"category_id": int,
"bbox": List[float],
"area": float
},
...more annotations here
]
}
Note:
All input lists much be the same length and in the same order.
Otherwise, there will be mismatched annotations.
"""
# 2. Turn input lists into a list of dicts in SingleCOCOAnnotation format
= [
coco_format_annotations
asdict(SingleCOCOAnnotation(=image_id,
image_id=category,
category_id=list(bbox),
bbox=area,
area
))for category, area, bbox in zip(categories, areas, bboxes)
]
# 3. Return a of annotations with format {"image_id": ..., "annotations": [...]} (required COCO format)
return asdict(ImageCOCOAnnotations(image_id=image_id,
=coco_format_annotations)) annotations
Nice!
Having those pre-built dataclasses makes everything else fall into place.
Now let’s try our format_image_annotations_as_coco
function on our random_sample
from before.
First, we’ll remind ourselves what our random_sample
looks like.
# Inpsect our random sample (in original format)
random_sample
{'image': <PIL.Image.Image image mode=RGB size=960x1280>,
'image_id': 267,
'annotations': {'file_name': ['ffb7d590-6667-4b28-8770-f2039267b251.jpeg',
'ffb7d590-6667-4b28-8770-f2039267b251.jpeg',
'ffb7d590-6667-4b28-8770-f2039267b251.jpeg',
'ffb7d590-6667-4b28-8770-f2039267b251.jpeg'],
'image_id': [267, 267, 267, 267],
'category_id': [5, 1, 0, 0],
'bbox': [[346.6000061035156,
510.3999938964844,
364.8999938964844,
411.6000061035156],
[590.4000244140625, 693.5, 264.8999938964844, 310.29998779296875],
[205.60000610351562,
530.0999755859375,
378.70001220703125,
484.20001220703125],
[709.5999755859375, 604.0, 241.5, 642.2000122070312]],
'iscrowd': [0, 0, 0, 0],
'area': [150192.84375, 82198.46875, 183366.546875, 155091.296875]},
'label_source': 'manual_prodigy_label',
'image_source': 'manual_taken_photo'}
Ok wonderful, looks like we can extract the image_id
, category_id
bbox
and area
fields from our random_sample
to get the required inputs to our format_image_annotations_as_coco
function.
Let’s try it out.
# Extract image_id, categories, areas, and bboxes from the random sample
= random_sample["image_id"]
random_sample_image_id = random_sample["annotations"]["category_id"]
random_sample_categories = random_sample["annotations"]["area"]
random_sample_areas = random_sample["annotations"]["bbox"]
random_sample_bboxes
# Format the random sample annotations as COCO format
= format_image_annotations_as_coco(image_id=random_sample_image_id,
random_sample_coco_annotations =random_sample_categories,
categories=random_sample_areas,
areas=random_sample_bboxes)
bboxes random_sample_coco_annotations
{'image_id': 267,
'annotations': [{'image_id': 267,
'category_id': 5,
'bbox': [346.6000061035156,
510.3999938964844,
364.8999938964844,
411.6000061035156],
'area': 150192.84375,
'iscrowd': 0},
{'image_id': 267,
'category_id': 1,
'bbox': [590.4000244140625, 693.5, 264.8999938964844, 310.29998779296875],
'area': 82198.46875,
'iscrowd': 0},
{'image_id': 267,
'category_id': 0,
'bbox': [205.60000610351562,
530.0999755859375,
378.70001220703125,
484.20001220703125],
'area': 183366.546875,
'iscrowd': 0},
{'image_id': 267,
'category_id': 0,
'bbox': [709.5999755859375, 604.0, 241.5, 642.2000122070312],
'area': 155091.296875,
'iscrowd': 0}]}
Woohoo!
Looks like we may have just fixed our ValueError
from before:
ValueError: Invalid COCO detection annotations. Annotations must a dict (single image) or list of dicts (batch of images) with the following keys:
image_id
andannotations
, with the latter being a list of annotations in the COCO format.
Our COCO formatted annotations have the image_id
and annotations
keys and our annotations
are a list of annotations in COCO format.
Perfect!
9.5 Preprocess a single image and set of COCO format annotations
Now we’ve preprocessed our annotations to be in COCO format, we can use them with transformers.ConditionalDetrImageProcessor.preprocess
.
Let’s pass our random_sample
image and COCO formatted annotations to the preprocess
method.
The default value for the parameter do_convert_annotations
of the preprocess
method is True
.
This means our boxes will go into the preprocess
method in absolute XYWH
format (the format we downloaded them in) and will be returned in normalized CXCYWH
(or (center_x, center_y, width, height)
) format.
Whenever you perform adjustments or preprocessing steps on your annotations, it’s always good to keep track of the format that they are in, otherwise it can lead to unexpected bugs later on.
# Preprocess random sample image and assosciated annotations
= image_processor.preprocess(images=random_sample["image"],
random_sample_preprocessed =random_sample_coco_annotations,
annotations=True, # defaults to True, this will convert our annotations to normalized CXCYWH format
do_convert_annotations="pt" # can return as tensors or not, "pt" returns as PyTorch tensors
return_tensors )
The `max_size` parameter is deprecated and will be removed in v4.26. Please specify in `size['longest_edge'] instead`.
When processing our single image and annotation, you may see a warning similar to the following:
The
max_size
parameter is deprecated and will be removed in v4.26. Please specify insize['longest_edge'] instead
.
If you are not using the max_size
parameter and are using a version of transformers
> 4.26, you can ignore this or disable it (as shown below).
# Optional: Disable warnings about `max_size` parameter being deprecated
import warnings
"ignore", message="The `max_size` parameter is deprecated*") warnings.filterwarnings(
Excellent!
It looks like the preprocess
method worked on our single sample.
Let’s inspect the keys()
method of our random_sample_preprocessed
.
# Check the keys of our preprocessed example
random_sample_preprocessed.keys()
dict_keys(['pixel_values', 'pixel_mask', 'labels'])
Wonderful, we get a preprocessed image and labels:
pixel_values
= preprocessed pixels (the preprocessed image).pixel_mask
= whether or not to mask the pixels (e.g. 0 = mask, 1 = no mask, in our case, all values will be1
since we want the model to see all pixels).labels
= preprocessed labels (the preprocessed annotations).
# Inspect preprocessed image shape
print(f"[INFO] Preprocessed image shape: {random_sample_preprocessed['pixel_values'].shape} -> [batch_size, colour_channels, height, width]")
[INFO] Preprocessed image shape: torch.Size([1, 3, 640, 480]) -> [batch_size, colour_channels, height, width]
Since we only passed a single sample to preprocess
, we get back a batch size of 1.
Now how do our labels look?
# Inspect the preprocessed labels (our boxes and other metadata)
"labels"]) pprint(random_sample_preprocessed[
[{'area': tensor([37548.2109, 20549.6172, 45841.6367, 38772.8242]),
'boxes': tensor([[0.5511, 0.5595, 0.3801, 0.3216],
[0.7530, 0.6630, 0.2759, 0.2424],
[0.4114, 0.6033, 0.3945, 0.3783],
[0.8649, 0.7227, 0.2516, 0.5017]]),
'class_labels': tensor([5, 1, 0, 0]),
'image_id': tensor([267]),
'iscrowd': tensor([0, 0, 0, 0]),
'orig_size': tensor([1280, 960]),
'size': tensor([640, 480])}]
Let’s break this down:
area
- An array/tensor of floats containing the area (box_width * box_height
) of our boxes.boxes
- An array/tensor containing all of the bounding boxes for our image in normalizedCXCYWH
((center_x, center_y, width, height)
) format.class_labels
- An array/tensor of integer labels assosciated with each box (e.g.tensor([5, 1, 0, 0, 4])
->['trash', 'hand', 'bin', 'bin', 'not_trash']
).image_id
- A unique integer identifier for our target image.is_crowd
- An array/tensor of a boolean value (0 or 1) for whether an annotation is a group or not.orig_size
- An array/tensor containing the original size in(height, width)
format (this is important for drawing conversion factors when using originally sized images).size
- An array/tensor with the current size in(height, width)
format of the processed image tensor contained withinrandom_sample_preprocessed["pixel_values"]
.
Woohoo!
We’ve done it!
We’ve officially preprocessed a single sample of our own data, both the image and its annotation pair.
We’ll write some code later on to scale this up to our whole dataset.
For now, let’s see what it looks like postprocessing a single output.
10 TK - Postprocessing a single output
We’ve got our inputs processed and successfully passed them through our model.
How about we postprocess the outputs of our model?
Doing so will make our model’s outputs far more usable.
11 Going end-to-end on a single sample
When working on a new problem or with a custom dataset and an existing model, it’s good practice to go end-to-end on a single sample.
For example, preprocess one of your samples, pass it through the model and then postprocess it (just like we’re in the middle of doing here).
Being able to go end-to-end on a single sample will help you see the overall process and discover any bugs that may hinder you later on.
To postprocess the outputs of our model we can use the transformers.ConditionalDetrImageProcessor.post_process_object_detection()
method.
Let’s frist recompute the model’s outputs for our preprocessed single sample.
# Recompute the random sample outputs with our preprocessed sample
= model(
random_sample_outputs =random_sample_preprocessed["pixel_values"], # model expects input [batch_size, color_channels, height, width]
pixel_values=random_sample_preprocessed["pixel_mask"],
pixel_mask
)
# Inspect the output type
type(random_sample_outputs)
transformers.models.conditional_detr.modeling_conditional_detr.ConditionalDetrObjectDetectionOutput
Wonderful!
We get the exact output our post_process_object_detection()
method is looking for.
Now we can fill in the following parameters:
outputs
- Raw outputs of the model (for us, this israndom_sample_outputs
).threshold
- A float score value to keep or discard boxes (e.g.threshold=0.3
means all boxes under0.3
will be discarded). This value can be adjusted as needed. A higher value means only the boxes the model is most confident on will be kept. A lower value means more boxes will be kept, however, these may be over lower quality. Best to be experimented with.target_sizes
- Size of target image in(height, width)
format for bounding boxes. For example, if our image is 960 pixels wide by 1280 high, we could pass in[1280, 960]
. Number oftarget_sizes
must match number ofoutputs
. For example, if pass in 1 set of outputs, only 1target_sizes
is needed. If we pass in a batch of 32outputs
, 32target_sizes
are required, else it will error. IfNone
, postprocessed outputs won’t be resized (this can be lead to poor looking boxes as the coordinates don’t match your image).top_k
- Integer defining the number of boxes you’d like to prepare for postprocessing before thresholding. Defaults to100
. For example,top_k=100
andthreshold=0.3
means sample 100 boxes and then of those 100 boxes, only keep those with a score over 0.3.
You can see what happens behind the scenes of post_process_object_detection
in the source code.
# Post process a single output from our model
= image_processor.post_process_object_detection(
random_sample_outputs_post_processed =random_sample_outputs,
outputs=0.62, # all boxes with scores under this value will be discarded (best to experiment with it)
threshold=[random_sample_preprocessed["labels"][0]["orig_size"]] # original input image size (or whichever target size you'd like), required to be same number of input items in a list
target_sizes
)
random_sample_outputs_post_processed
[{'scores': tensor([0.6514, 0.6501, 0.6496, 0.6377, 0.6369, 0.6342, 0.6337, 0.6334, 0.6331,
0.6330, 0.6326, 0.6322, 0.6317, 0.6300, 0.6275, 0.6266, 0.6262, 0.6243,
0.6240, 0.6233, 0.6232, 0.6230, 0.6214, 0.6209, 0.6205, 0.6205, 0.6204,
0.6201], grad_fn=<IndexBackward0>),
'labels': tensor([6, 6, 6, 6, 6, 6, 6, 3, 6, 6, 0, 6, 6, 4, 6, 6, 6, 6, 3, 6, 6, 6, 6, 6,
6, 6, 5, 6]),
'boxes': tensor([[ 9.1235e+02, 6.3140e+02, 9.5952e+02, 6.8144e+02],
[ 1.5850e+02, 2.8896e+02, 2.1996e+02, 3.1913e+02],
[ 4.4281e+02, 3.0125e+02, 5.0944e+02, 3.3654e+02],
[ 3.4421e+02, 3.4005e+02, 3.8862e+02, 3.7213e+02],
[ 4.6401e+02, 5.1168e+02, 5.6634e+02, 5.5719e+02],
[ 3.6855e+02, 3.3014e+02, 4.2090e+02, 3.9279e+02],
[ 6.6788e+02, 8.9587e+02, 7.0936e+02, 9.4438e+02],
[ 6.7845e+02, 8.5797e+02, 7.6454e+02, 1.0156e+03],
[ 1.3206e+02, 2.7962e+02, 1.8723e+02, 3.2723e+02],
[ 3.7477e+02, 3.2468e+02, 4.2061e+02, 3.5514e+02],
[ 7.0972e+02, 9.4040e+02, 7.6190e+02, 1.0178e+03],
[ 3.6536e+02, 3.3430e+02, 5.1969e+02, 4.2882e+02],
[ 7.7893e+02, 6.0122e+02, 9.5037e+02, 7.1507e+02],
[ 2.0268e+00, 7.2448e+02, 9.6089e+02, 1.2638e+03],
[-1.7890e+00, 4.9846e+02, 7.1313e+01, 6.0997e+02],
[ 4.0764e-02, 4.5550e+02, 4.4586e+00, 5.5671e+02],
[ 6.3387e+02, 5.9976e+02, 9.4692e+02, 9.1215e+02],
[ 3.0457e+02, 5.4763e+02, 4.4348e+02, 6.3650e+02],
[ 2.8229e+02, 3.2606e+02, 7.2076e+02, 5.2406e+02],
[-5.7805e-01, -4.6888e-01, 1.0614e+01, 5.5209e+01],
[ 4.8215e+02, 2.6792e+02, 5.2980e+02, 2.9737e+02],
[ 6.6114e+02, 3.0081e+02, 7.4817e+02, 3.4843e+02],
[ 1.3743e+00, 4.9630e+02, 1.0422e+02, 5.9861e+02],
[ 6.2789e+02, 3.1401e+02, 6.9397e+02, 3.5158e+02],
[ 6.2282e+02, 3.3377e+02, 9.4820e+02, 5.0515e+02],
[ 7.1680e+02, 3.3967e+02, 9.5090e+02, 5.0642e+02],
[ 6.8087e+02, 9.2954e+02, 7.4110e+02, 1.0025e+03],
[ 6.8087e+02, 9.2954e+02, 7.4110e+02, 1.0025e+03]],
grad_fn=<IndexBackward0>)}]
UPTOHERE - break down the outputs of our postprocessed sample, then explain each of them before plotting a set of poor boxes on the image
TK - break the following down:
- scores = found by applying torch.sigmoid() to the raw outputs
- labels = predicted class labels, we can turn these into class names using id2label
- boxes = box coordinates in normalized XYXY format (x_top_left, y_top_left, x_bottom_right, y_bottom_right)
# TK - to find the scores: Output logits will be post-processed to turn into prediction probabilities as well as boxes
# (see source code for more)
# TK - behind the scenes the post_process_object_detection function will perform a sigmoid operation on the logits to get the prediction probabilities
# Get pred probs from logits, this will be used for our threshold parameter in post_process_object_detection
torch.sigmoid(random_sample_outputs.logits)
tensor([[[0.5651, 0.4310, 0.5210, ..., 0.5277, 0.5380, 0.4887],
[0.4769, 0.4815, 0.5547, ..., 0.5068, 0.5578, 0.5497],
[0.5657, 0.4081, 0.5427, ..., 0.4618, 0.5685, 0.5589],
...,
[0.5532, 0.4581, 0.5263, ..., 0.5532, 0.5883, 0.5798],
[0.5777, 0.4055, 0.5471, ..., 0.4489, 0.5428, 0.6063],
[0.5283, 0.4102, 0.5490, ..., 0.5313, 0.5663, 0.5830]]],
grad_fn=<SigmoidBackward0>)
TK - How the size comes about
Original size vs preprocessed size…
# TK - note on preprocessed size vs original size (if we want the boxes to look good on the original image, they should be adjusted)
print(f"[INFO] Image original size: {random_sample_preprocessed.labels[0].orig_size} (height, width)")
print(f"[INFO] Image size after preprocessing: {random_sample_preprocessed.labels[0].size} (height, width)")
[INFO] Image original size: tensor([1280, 960]) (height, width)
[INFO] Image size after preprocessing: tensor([640, 480]) (height, width)
11.1 TK - Plotting our models first box predictions on an image
TK - let’s visualize, visualize, visualize!
boxes are in format XYXY, the format required by draw_bounding_boxes
- https://pytorch.org/vision/main/generated/torchvision.utils.draw_bounding_boxes.html
# Extract scores, labels and boxes
= random_sample_outputs_post_processed[0]["scores"]
random_sample_pred_scores = random_sample_outputs_post_processed[0]["labels"]
random_sample_pred_labels = half_boxes(random_sample_outputs_post_processed[0]["boxes"])
random_sample_pred_boxes
# Create a list of labels and colours to plot on the image/boxes
= [f"Pred: {id2label[label_pred.item()]} ({round(score_pred.item(), 4)})"
random_sample_pred_labels_to_plot for label_pred, score_pred in zip(random_sample_pred_labels, random_sample_pred_scores)]
= [colour_palette[id2label[label_pred.item()]] for label_pred in random_sample_pred_labels]
random_sample_pred_colours
print(f"[INFO] Labels with scores: {random_sample_pred_labels_to_plot[:3]}...")
# Plot the random sample image with randomly predicted boxes
# (these will be very poor since the model is not trained on our data yet)
to_pil_image(=draw_bounding_boxes(
pic=pil_to_tensor(pic=half_image(random_sample["image"])),
image=random_sample_pred_boxes, # boxes are in XYXY format, which is required for draw_bounding_boxes
boxes=random_sample_labels_to_plot,
labels=random_sample_pred_colours,
colors=3
width
) )
[INFO] Labels with scores: ['Pred: trash_arm (0.6514)', 'Pred: trash_arm (0.6501)', 'Pred: trash_arm (0.6496)']...
Our predictions are poor since our model hasn’t been specifically trained on our data.
But we can improve them by fine-tuning the model to our dataset.
12 TK - Bounding box formats in and out of our model
TK - make a table here of different bounding box formats and how they change at different stages
TK image - turn this into a nice table/image
Box formats:
- Starting data (the input data) -> [x_top_left, y_top_left, width, height] ->
XYWH
(absolute) - Out of
image_processor.preprocess()
-> [center_x, center_y, width, height] ->CXCYWH
(normalized) -> into model- See docs: https://huggingface.co/docs/transformers.js/en/custom_usage
- Out of
model
-> [center_x, center_y, width, height] ->CXCYWH
(normalized)- See docs for
forward()
and outputpred_boxes
: https://huggingface.co/docs/transformers/main/en/model_doc/conditional_detr#transformers.ConditionalDetrForObjectDetection.forward
- See docs for
- Out of
image_processor.post_process_object_detection()
-> [x_top_left, y_top_left, x_bottom_right, y_bottom_right] ->XYXY
- This is PASCL VOC format - (xmin, ymin, xmax, ymax)
- See docs: https://huggingface.co/docs/transformers/main/en/model_doc/conditional_detr#transformers.ConditionalDetrImageProcessor.post_process_object_detection
13 TK - Fine-tune the model to our dataset
Steps: - preprocess dataset (no augmentation) - get it ready for a model to train on - train model - inspect the results of the trained model
13.1 TK - Preprocess dataset for model
- We’ve preprocessed and tried one sample, now we can do the same for batches of data.
UPTOHERE
Next: - TK - write a function to transform batches of images (no augmentation, later can add augmentation) - TK - e.g. call it “preprocess_batch_of_examples” - TK - preprocess datasets using .with_transform (only need one function to batchify data, can add transforms later) - TK - create a collate function
def preprocess_batch(examples,
# transforms, # Note: Could optionally add transforms (e.g. data augmentation) here
image_processor):"""
Function to preprocess batches of data.
Can optionally apply a transform later on.
"""
= []
images = []
coco_annotations
for image, image_id, annotations_dict in zip(examples["image"], examples["image_id"], examples["annotations"]):
# Note: may need to open image if it is an image path rather than PIL.Image
= annotations_dict["bbox"]
bbox_list = annotations_dict["category_id"]
category_list = annotations_dict["area"]
area_list
# Note: Could optionally apply a transform here.
###
# Format the annotations into COCO format
= format_image_annotations_as_coco(image_id=image_id,
cooc_format_annotations =category_list,
categories=area_list,
areas=bbox_list)
bboxes
# Add images/annotations to their respective lists
images.append(image)
coco_annotations.append(cooc_format_annotations)
# Apply the image processor to lists of images and annotations
= image_processor.preprocess(images=images,
preprocessed_batch =coco_annotations,
annotations="pt")
return_tensors
return preprocessed_batch
# Create a partial function for preprocessing
from functools import partial
# Note: Could create separate
= partial(preprocess_batch,
preprocess_batch_partial =image_processor) image_processor
13.2 TK - Split the data
# Split the data
= dataset["train"].train_test_split(test_size=0.3, seed=42) # split the dataset into 70/30 train/test
dataset_split = dataset_split["test"].train_test_split(test_size=0.6, seed=42) # split the test set into 40/60 validation/test
dataset_test_val_split
# Create splits
"train"] = dataset_split["train"]
dataset["validation"] = dataset_test_val_split["train"]
dataset["test"] = dataset_test_val_split["test"]
dataset[
dataset
TK - apply processing function to each split
# Apply the preprocessing function to the datasets (the preprocessing will happen on the fly, e.g. when the dataset is called rather than in-place)
= dataset.copy()
processed_dataset "train"] = dataset["train"].with_transform(transform=preprocess_batch_partial)
processed_dataset["validation"] = dataset["validation"].with_transform(transform=preprocess_batch_partial)
processed_dataset["test"] = dataset["test"].with_transform(transform=preprocess_batch_partial) processed_dataset[
"validation"][0] processed_dataset[
{'pixel_values': tensor([[[ 0.1254, 0.1254, 0.1597, ..., -2.0837, -1.9809, -1.9295],
[ 0.1426, 0.1254, 0.1597, ..., -2.0494, -1.9638, -1.9467],
[ 0.1426, 0.1426, 0.1597, ..., -1.9467, -1.9295, -1.9467],
...,
[ 1.2899, 1.0502, 1.1358, ..., 0.7248, 0.7933, 0.7762],
[ 1.4098, 1.1872, 1.0331, ..., 0.7077, 0.7419, 0.7419],
[ 1.2728, 0.9646, 0.9303, ..., 0.7077, 0.7591, 0.7248]],
[[ 1.2206, 1.1856, 1.1506, ..., -1.9832, -1.8782, -1.7731],
[ 1.2381, 1.1856, 1.1506, ..., -1.9657, -1.8606, -1.8256],
[ 1.2381, 1.2031, 1.1681, ..., -1.8606, -1.8256, -1.8431],
...,
[ 1.2906, 1.0630, 1.1506, ..., 0.3803, 0.4503, 0.4328],
[ 1.4307, 1.2031, 1.0280, ..., 0.3627, 0.3978, 0.3978],
[ 1.2906, 0.9755, 0.9230, ..., 0.3627, 0.4153, 0.3803]],
[[ 2.1346, 2.2217, 2.1868, ..., -1.7173, -1.6127, -1.5604],
[ 2.1520, 2.2217, 2.1868, ..., -1.6999, -1.5953, -1.5779],
[ 2.1694, 2.2217, 2.1868, ..., -1.5953, -1.5430, -1.5604],
...,
[ 1.2108, 0.9842, 1.0539, ..., 0.3568, 0.4265, 0.4091],
[ 1.3154, 1.0888, 0.9494, ..., 0.3393, 0.3742, 0.3742],
[ 1.1759, 0.8622, 0.8448, ..., 0.3393, 0.3916, 0.3568]]]),
'pixel_mask': tensor([[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1],
...,
[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1]]),
'labels': {'size': tensor([640, 480]), 'image_id': tensor([719]), 'class_labels': tensor([4, 4, 1, 5, 0, 0]), 'boxes': tensor([[0.1898, 0.1767, 0.2161, 0.1620],
[0.5669, 0.1938, 0.0742, 0.0805],
[0.7672, 0.7768, 0.4526, 0.4327],
[0.4715, 0.6213, 0.2235, 0.1502],
[0.3973, 0.5639, 0.7729, 0.6337],
[0.6906, 0.4581, 0.5110, 0.4600]]), 'area': tensor([ 10753.6875, 1833.4000, 60167.3867, 10316.8945, 150459.0469,
72216.3203]), 'iscrowd': tensor([0, 0, 0, 0, 0, 0]), 'orig_size': tensor([1280, 960])}}
# Now when we call one or more of our samples, the preprocessing will take place
"train"][0:10] processed_dataset[
{'pixel_values': tensor([[[[-1.5870, -1.5870, -1.6042, ..., -1.2617, -1.2617, -1.2788],
[-1.5870, -1.5870, -1.5870, ..., -0.9363, -0.9192, -0.9192],
[-1.6042, -1.5870, -1.5870, ..., -0.8164, -0.8335, -0.8164],
...,
[-1.2959, -1.4329, -0.5938, ..., -0.5596, -0.2856, -0.4054],
[-1.2103, -0.9192, -0.3541, ..., -0.5596, 0.1426, 0.1768],
[-0.5938, -0.6109, -0.7137, ..., -0.4226, 0.4337, 0.6906]],
[[-1.9482, -1.9482, -1.9657, ..., -1.0903, -1.0903, -1.1078],
[-1.9482, -1.9482, -1.9482, ..., -0.7227, -0.6877, -0.7052],
[-1.9657, -1.9482, -1.9482, ..., -0.5476, -0.5826, -0.5651],
...,
[-0.9503, -1.0728, -0.1975, ..., -0.1625, 0.0826, -0.0924],
[-0.8803, -0.5476, 0.0476, ..., -0.1625, 0.5028, 0.4678],
[-0.2150, -0.1975, -0.2850, ..., -0.0399, 0.7654, 0.9755]],
[[-1.7347, -1.7347, -1.7522, ..., -0.8807, -0.8807, -0.8981],
[-1.7347, -1.7347, -1.7347, ..., -0.5321, -0.4973, -0.5147],
[-1.7522, -1.7347, -1.7347, ..., -0.3753, -0.3927, -0.3753],
...,
[-1.4210, -1.4907, -0.8110, ..., -1.0550, -0.7761, -0.9330],
[-1.3861, -1.2467, -0.8110, ..., -1.0898, -0.3927, -0.4275],
[-1.0376, -1.1247, -1.3687, ..., -0.9853, -0.1835, 0.0256]]],
[[[-1.7412, -1.8268, -1.7754, ..., -1.5870, -1.2788, -1.4329],
[-1.6555, -1.6213, -1.7583, ..., -1.3815, -1.4158, -1.7240],
[-1.7583, -1.7583, -1.3987, ..., -1.6042, -1.8782, -1.9124],
...,
[ 0.2624, 1.4440, 1.3584, ..., 0.3823, 0.8276, 1.0502],
[ 0.4851, 1.4783, 0.3823, ..., 1.2557, 0.9988, 0.7419],
[-0.0801, -0.0116, -0.1828, ..., 0.9988, 0.8276, 0.8447]],
[[-1.5280, -1.6155, -1.5455, ..., -1.4930, -1.1604, -1.3179],
[-1.4755, -1.3704, -1.5105, ..., -1.2654, -1.3004, -1.5980],
[-1.5980, -1.5455, -1.1078, ..., -1.4755, -1.7731, -1.8081],
...,
[ 0.3978, 1.6057, 1.5182, ..., 0.4853, 0.9230, 1.1155],
[ 0.6254, 1.6408, 0.5203, ..., 1.3782, 1.1155, 0.8354],
[ 0.0476, 0.1176, -0.0749, ..., 1.1331, 0.9405, 0.9230]],
[[-1.7173, -1.6824, -1.6127, ..., -1.3164, -1.0550, -1.2293],
[-1.5430, -1.5779, -1.6650, ..., -1.1073, -1.2119, -1.5256],
[-1.5953, -1.6476, -1.4733, ..., -1.3513, -1.6650, -1.7347],
...,
[ 0.4439, 1.6640, 1.5942, ..., 0.4265, 0.8099, 0.9842],
[ 0.6531, 1.6814, 0.6008, ..., 1.2631, 0.9668, 0.6356],
[ 0.0605, 0.1651, 0.0256, ..., 0.9842, 0.7576, 0.7054]]],
[[[-0.9363, -0.7479, -1.0390, ..., -2.1008, -2.1008, -2.0665],
[-1.3302, -0.9363, -0.7822, ..., -2.1008, -2.1008, -2.0665],
[-1.5014, -1.2617, -0.9705, ..., -2.1008, -2.1008, -2.1008],
...,
[ 1.8550, 1.8379, 1.7523, ..., 1.2899, 1.2899, 0.8789],
[ 1.8208, 1.7523, 1.6838, ..., 1.1015, 1.3927, 0.9474],
[ 1.7009, 1.6153, 1.6324, ..., 1.1187, 1.4783, 1.1187]],
[[-0.7577, -0.5651, -0.8627, ..., -1.9832, -1.9832, -1.9482],
[-1.1604, -0.7577, -0.6001, ..., -1.9832, -1.9657, -1.9307],
[-1.3354, -1.0903, -0.7927, ..., -1.9832, -1.9832, -1.9657],
...,
[ 1.1681, 1.1506, 1.0630, ..., 1.5007, 1.4482, 0.9755],
[ 1.1331, 1.0455, 0.9930, ..., 1.2906, 1.5357, 1.0455],
[ 1.0280, 0.9580, 0.9755, ..., 1.3256, 1.6408, 1.2206]],
[[-1.0376, -0.8110, -1.0724, ..., -1.5779, -1.5779, -1.5604],
[-1.4036, -0.9678, -0.7761, ..., -1.5779, -1.5604, -1.5604],
[-1.5779, -1.2816, -0.9678, ..., -1.5779, -1.5604, -1.5953],
...,
[ 0.8622, 0.8448, 0.7576, ..., 1.5768, 1.4897, 0.9842],
[ 0.8274, 0.7576, 0.7228, ..., 1.3851, 1.5768, 1.0539],
[ 0.7402, 0.6705, 0.7054, ..., 1.4025, 1.6814, 1.2282]]],
...,
[[[-1.2103, -1.1760, -1.1075, ..., -0.7822, -0.9877, -1.0904],
[-0.9192, -0.9705, -1.0219, ..., -0.7993, -1.1247, -1.0219],
[-0.5424, -0.8678, -1.0733, ..., -1.0219, -1.2103, -0.9192],
...,
[ 1.2385, 0.7591, 0.2624, ..., 1.2214, 0.9132, 0.8618],
[ 1.2385, 0.9474, 1.0502, ..., 0.9646, -0.0801, 0.1083],
[ 1.1187, 1.1872, 0.9474, ..., 0.6906, 0.2967, 0.3652]],
[[-1.0728, -1.0378, -0.9678, ..., -0.6001, -0.8102, -0.9153],
[-0.7752, -0.8277, -0.8803, ..., -0.6352, -0.9503, -0.8452],
[-0.3901, -0.7227, -0.9328, ..., -0.8627, -1.0553, -0.7752],
...,
[ 1.0980, 0.6429, 0.1527, ..., 1.3081, 0.9755, 0.9405],
[ 1.0980, 0.8704, 0.9755, ..., 1.0280, -0.0574, 0.1352],
[ 0.9930, 1.1155, 0.8704, ..., 0.7479, 0.3102, 0.3803]],
[[-1.2641, -1.2293, -1.1247, ..., -0.9504, -1.1596, -1.2293],
[-0.9330, -1.0027, -1.0376, ..., -0.9678, -1.2641, -1.1421],
[-0.5321, -0.8633, -1.0898, ..., -1.1421, -1.2990, -0.9853],
...,
[ 0.9319, 0.5659, 0.1651, ..., 1.1934, 0.9145, 0.8797],
[ 1.0539, 0.7751, 0.8797, ..., 0.8971, -0.1312, 0.0605],
[ 1.0365, 1.0017, 0.6356, ..., 0.5834, 0.1999, 0.2871]]],
[[[-1.4843, -1.3473, -1.4329, ..., -0.9020, -0.8678, -0.8507],
[-1.6898, -1.6555, -1.4843, ..., -0.8507, -0.8507, -0.8507],
[-1.4500, -1.6898, -1.3987, ..., -0.8507, -0.8678, -0.8849],
...,
[-0.8849, -0.7308, -0.4911, ..., 1.8208, 1.8722, 1.8722],
[-1.2274, -1.0219, -0.6109, ..., 1.8550, 1.9064, 1.9064],
[-1.7069, -1.4843, -1.1418, ..., 1.8379, 1.9235, 1.9578]],
[[-1.2829, -1.1779, -1.3004, ..., 0.2752, 0.2577, 0.2402],
[-1.4755, -1.4930, -1.3529, ..., 0.2577, 0.2752, 0.2752],
[-1.2829, -1.5280, -1.2654, ..., 0.2927, 0.2927, 0.2927],
...,
[-0.7752, -0.6176, -0.3550, ..., 1.1681, 1.2381, 1.2556],
[-1.1429, -0.9328, -0.4951, ..., 1.2031, 1.2906, 1.3081],
[-1.6331, -1.4055, -1.0378, ..., 1.2031, 1.3081, 1.3606]],
[[-1.3164, -1.1944, -1.2641, ..., 1.7511, 1.7511, 1.7685],
[-1.5256, -1.5256, -1.3513, ..., 1.7685, 1.7337, 1.6814],
[-1.2990, -1.5430, -1.2641, ..., 1.7337, 1.7511, 1.7511],
...,
[-0.6890, -0.5321, -0.2881, ..., 0.8971, 0.9668, 0.9668],
[-1.0027, -0.8110, -0.4101, ..., 0.9319, 1.0017, 1.0191],
[-1.4733, -1.2816, -0.9504, ..., 0.9145, 1.0191, 1.0714]]],
[[[-1.6042, -1.6213, -1.5870, ..., -0.1486, -0.1314, 0.0056],
[-1.5699, -1.5528, -1.5699, ..., -0.1314, -0.1143, 0.0569],
[-1.5870, -1.5185, -1.4843, ..., -0.1143, -0.0629, 0.1597],
...,
[ 0.9132, 1.1187, 1.3413, ..., -0.7822, -0.7822, -0.7650],
[ 1.4440, 1.0844, 1.3242, ..., -0.7993, -0.7650, -0.7479],
[ 1.3755, 0.8961, 1.3927, ..., -0.8335, -0.7993, -0.7993]],
[[-1.5980, -1.6506, -1.6506, ..., -0.0224, -0.0049, 0.1176],
[-1.5980, -1.6155, -1.6331, ..., -0.0224, -0.0049, 0.1527],
[-1.6506, -1.5980, -1.5805, ..., -0.0399, 0.0126, 0.2402],
...,
[ 0.4853, 0.7129, 0.9580, ..., -0.7577, -0.7402, -0.7227],
[ 1.0280, 0.6604, 0.9230, ..., -0.7752, -0.7402, -0.7052],
[ 0.9405, 0.4503, 1.0105, ..., -0.8102, -0.7752, -0.7577]],
[[-1.5256, -1.5604, -1.5430, ..., 0.0779, 0.1128, 0.2522],
[-1.4733, -1.4733, -1.4907, ..., 0.0779, 0.1302, 0.2871],
[-1.4733, -1.4210, -1.3861, ..., 0.0779, 0.1476, 0.3742],
...,
[ 0.1476, 0.3393, 0.5834, ..., -0.6541, -0.6715, -0.6541],
[ 0.7054, 0.3393, 0.6008, ..., -0.6715, -0.6367, -0.6367],
[ 0.6705, 0.1651, 0.7228, ..., -0.6890, -0.6541, -0.6890]]]]), 'pixel_mask': tensor([[[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1],
...,
[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1]],
[[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1],
...,
[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1]],
[[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1],
...,
[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1]],
...,
[[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1],
...,
[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1]],
[[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1],
...,
[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1]],
[[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1],
...,
[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1]]]), 'labels': [{'size': tensor([640, 480]), 'image_id': tensor([69]), 'class_labels': tensor([5, 0, 1, 4, 4, 4, 4, 4]), 'boxes': tensor([[0.4675, 0.5152, 0.1846, 0.2045],
[0.5092, 0.5843, 0.3970, 0.3951],
[0.2719, 0.5861, 0.3738, 0.2471],
[0.1023, 0.6896, 0.2019, 0.1655],
[0.3902, 0.0924, 0.1530, 0.0898],
[0.5345, 0.0871, 0.0252, 0.0556],
[0.6370, 0.0877, 0.1357, 0.0899],
[0.9383, 0.0634, 0.0789, 0.0627]]), 'area': tensor([11597.7402, 48180.5664, 28372.1094, 10266.5547, 4223.3750, 430.7600,
3749.3826, 1517.7850]), 'iscrowd': tensor([0, 0, 0, 0, 0, 0, 0, 0]), 'orig_size': tensor([1280, 960])}, {'size': tensor([640, 480]), 'image_id': tensor([1027]), 'class_labels': tensor([5, 4, 1, 0, 0]), 'boxes': tensor([[0.4669, 0.5782, 0.1456, 0.1290],
[0.5031, 0.6013, 0.0410, 0.0237],
[0.5269, 0.6380, 0.1138, 0.1280],
[0.3863, 0.5047, 0.4801, 0.3840],
[0.1074, 0.4195, 0.2101, 0.3353]]), 'area': tensor([ 5770.2451, 298.4550, 4471.7402, 56633.0859, 21642.4102]), 'iscrowd': tensor([0, 0, 0, 0, 0]), 'orig_size': tensor([1280, 960])}, {'size': tensor([640, 480]), 'image_id': tensor([1092]), 'class_labels': tensor([2, 5, 1, 0]), 'boxes': tensor([[0.1943, 0.1126, 0.1849, 0.0794],
[0.5387, 0.5818, 0.3646, 0.2689],
[0.3515, 0.7725, 0.3171, 0.2903],
[0.5404, 0.4307, 0.6236, 0.4566]]), 'area': tensor([ 4508.5000, 30117.5000, 28278.7598, 87485.0391]), 'iscrowd': tensor([0, 0, 0, 0]), 'orig_size': tensor([1280, 960])}, {'size': tensor([640, 480]), 'image_id': tensor([228]), 'class_labels': tensor([0]), 'boxes': tensor([[0.5187, 0.5418, 0.4982, 0.5698]]), 'area': tensor([87218.0078]), 'iscrowd': tensor([0]), 'orig_size': tensor([1280, 960])}, {'size': tensor([640, 480]), 'image_id': tensor([511]), 'class_labels': tensor([5, 1]), 'boxes': tensor([[0.5284, 0.5886, 0.2903, 0.3347],
[0.7784, 0.7873, 0.4400, 0.4222]]), 'area': tensor([29848.7695, 57066.2383]), 'iscrowd': tensor([0, 0]), 'orig_size': tensor([1280, 960])}, {'size': tensor([640, 480]), 'image_id': tensor([338]), 'class_labels': tensor([5, 0, 1]), 'boxes': tensor([[0.4990, 0.5424, 0.2227, 0.1716],
[0.5455, 0.5335, 0.3754, 0.3595],
[0.7111, 0.6979, 0.3313, 0.2838]]), 'area': tensor([11742.9648, 41455.0117, 28882.3496]), 'iscrowd': tensor([0, 0, 0]), 'orig_size': tensor([1280, 960])}, {'size': tensor([640, 480]), 'image_id': tensor([405]), 'class_labels': tensor([0, 1, 5]), 'boxes': tensor([[0.4952, 0.6559, 0.6088, 0.4872],
[0.2074, 0.7760, 0.4117, 0.4459],
[0.4132, 0.5714, 0.0663, 0.0580]]), 'area': tensor([91107.9609, 56385.1602, 1179.7800]), 'iscrowd': tensor([0, 0, 0]), 'orig_size': tensor([1280, 960])}, {'size': tensor([640, 480]), 'image_id': tensor([3]), 'class_labels': tensor([0, 5, 1, 4, 4, 4]), 'boxes': tensor([[0.5020, 0.4466, 0.6579, 0.5829],
[0.5148, 0.5684, 0.2288, 0.1367],
[0.7040, 0.7836, 0.4468, 0.4219],
[0.3160, 0.8416, 0.3991, 0.2993],
[0.4095, 0.0661, 0.0888, 0.0666],
[0.7489, 0.1356, 0.3843, 0.2637]]), 'area': tensor([117809.1875, 9607.5000, 57901.5000, 36691.4023, 1814.7600,
31125.9375]), 'iscrowd': tensor([0, 0, 0, 0, 0, 0]), 'orig_size': tensor([1280, 960])}, {'size': tensor([640, 480]), 'image_id': tensor([182]), 'class_labels': tensor([0, 1, 5]), 'boxes': tensor([[0.5786, 0.5016, 0.5992, 0.4539],
[0.6307, 0.7197, 0.4165, 0.3323],
[0.4415, 0.6429, 0.1546, 0.2070]]), 'area': tensor([83547.7969, 42508.7344, 9827.7900]), 'iscrowd': tensor([0, 0, 0]), 'orig_size': tensor([1280, 960])}, {'size': tensor([640, 480]), 'image_id': tensor([640]), 'class_labels': tensor([5, 1, 0]), 'boxes': tensor([[0.5314, 0.6391, 0.2920, 0.4553],
[0.7088, 0.7733, 0.5596, 0.4422],
[0.5282, 0.5060, 0.5678, 0.4612]]), 'area': tensor([40839.7109, 76013.7969, 80443.1328]), 'iscrowd': tensor([0, 0, 0]), 'orig_size': tensor([1280, 960])}]}
# Images are reshaped to be the IMAGE_SIZE value that we set
"train"][0]["pixel_values"].shape processed_dataset[
torch.Size([3, 640, 480])
13.3 TK - Create a collation function
Notes: * The input to the data_collator
function will be the output of image_processor
, see below for format. * The output of the data_collator
will be passed to our model’s forward()
method. * data_collator
for transformers.Trainer
- https://huggingface.co/docs/transformers/en/main_classes/trainer#transformers.Trainer.data_collator * “The function to use to form a batch from a list of elements of train_dataset
.
Input to data_collator
is the output of image_processor
:
{'pixel_values': tensor([[[ 2.2318, 2.2318, 2.2318, ..., 0.3309, 0.2282, 0.1254],
[ 2.2318, 2.2318, 2.2318, ..., 0.3138, 0.2111, 0.1426],
[ 2.2318, 2.2318, 2.2489, ..., 0.2967, 0.2111, 0.1426],
...,
[-0.8164, -0.8164, -0.7993, ..., 0.5878, 0.5707, 0.5878],
[-0.9363, -0.8849, -0.8164, ..., 0.5193, 0.5364, 0.5707],
[-0.9877, -0.9363, -0.9192, ..., 0.5707, 0.5707, 0.5878]],
[[ 2.4286, 2.4286, 2.4286, ..., 0.4853, 0.4153, 0.3277],
[ 2.4286, 2.4286, 2.4286, ..., 0.4853, 0.3978, 0.3277],
[ 2.4286, 2.4286, 2.4286, ..., 0.4678, 0.3803, 0.3102],
...,
[-1.1253, -1.1253, -1.1078, ..., 0.2052, 0.1877, 0.2052],
[-1.2129, -1.1604, -1.1253, ..., 0.1352, 0.1527, 0.1877],
[-1.2479, -1.2129, -1.2304, ..., 0.1877, 0.1877, 0.2052]],
[[ 2.6051, 2.6051, 2.6051, ..., 0.6531, 0.6008, 0.5311],
[ 2.6051, 2.6051, 2.6051, ..., 0.6531, 0.5659, 0.5136],
[ 2.6051, 2.6051, 2.6051, ..., 0.6356, 0.5485, 0.4788],
...,
[-1.3861, -1.3687, -1.3339, ..., -0.2358, -0.2532, -0.2358],
[-1.4907, -1.4210, -1.3513, ..., -0.3055, -0.2881, -0.2532],
[-1.5256, -1.4733, -1.4559, ..., -0.2532, -0.2532, -0.2358]]]),
'pixel_mask': tensor([[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1],
...
[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1]]),
'labels': {'size': tensor([1066, 800]), 'image_id': tensor([0]), 'class_labels': tensor([1, 0]), 'boxes': tensor([[0.7553, 0.5571, 0.4196, 0.2626],
[0.5022, 0.5583, 0.9827, 0.8609]]), 'area': tensor([ 93955.8828, 721446.3750]), 'iscrowd': tensor([0, 0]), 'orig_size': tensor([1280, 960])}}
The data_collator
function will turn collections of these into batches (e.g. stack together the pixel_values
, pixel_mask
, labels
etc).
# Create data_collate_function to collect samples into batches
# TK - want to get a dictionary of {"pixel_mask": [batch_of_samples], "labels": [batch_of_samples], "pixel_mask": [batch_of_samples]}
def data_collate_function(batch):
= {}
collated_data
# Stack together a collection of pixel_values tensors
"pixel_values"] = torch.stack([sample["pixel_values"] for sample in batch])
collated_data[
# Get the labels (these are dictionaries so no need to use torch.stack)
"labels"] = [sample["labels"] for sample in batch]
collated_data[
# If there is a pixel_mask key, return the pixel_mask's as well
if "pixel_mask" in batch[0]:
"pixel_mask"] = torch.stack([sample["pixel_mask"] for sample in batch])
collated_data[
return collated_data
%%time
# Try data_collate_function
= data_collate_function(processed_dataset["train"].select(range(32)))
example_collated_data_batch "pixel_values"].shape example_collated_data_batch[
CPU times: user 2.01 s, sys: 131 ms, total: 2.14 s
Wall time: 1.45 s
torch.Size([32, 3, 640, 480])
example_collated_data_batch.keys()
dict_keys(['pixel_values', 'labels', 'pixel_mask'])
# 32 samples (because that's our batch size)
len(example_collated_data_batch["pixel_values"]), len(example_collated_data_batch["labels"]), len(example_collated_data_batch["pixel_mask"])
(32, 32, 32)
TK - We get a batch of 32 samples with size 640, 480, these are all preprocessed as well and will be fed to our model.
%%time
# Try pass a batch through our model (note: this will be slow if our model is on the CPU)
= model(example_collated_data_batch["pixel_values"])
example_batch_outputs example_batch_outputs
CPU times: user 1min 4s, sys: 12.5 s, total: 1min 17s
Wall time: 5.21 s
ConditionalDetrObjectDetectionOutput(loss=None, loss_dict=None, logits=tensor([[[ 0.1756, 0.0112, -0.1084, ..., 0.1422, 0.0683, 0.1605],
[-0.2120, -0.2104, -0.1722, ..., 0.3864, -0.1778, 0.2019],
[ 0.1066, 0.1096, 0.2123, ..., 0.1867, -0.0547, 0.2594],
...,
[-0.3185, 0.3699, -0.2245, ..., 0.1371, 0.2279, 0.2639],
[ 0.0702, 0.0533, 0.1279, ..., 0.2358, -0.1269, 0.2406],
[-0.1309, -0.3195, 0.1867, ..., 0.4492, -0.0839, 0.4281]],
[[ 0.1036, 0.0428, -0.2660, ..., 0.0152, 0.0188, 0.0505],
[-0.1730, -0.3609, -0.0393, ..., 0.2778, -0.2219, 0.1670],
[ 0.0929, 0.2278, 0.2457, ..., 0.0409, -0.1385, 0.1913],
...,
[-0.0265, 0.0631, 0.0627, ..., 0.0372, -0.1568, 0.0072],
[ 0.0708, 0.1320, 0.1984, ..., 0.1450, -0.0370, 0.1971],
[-0.2185, -0.3554, 0.0250, ..., 0.1523, -0.1766, -0.2412]],
[[-0.0034, -0.1252, -0.4586, ..., 0.0920, -0.0194, 0.0565],
[-0.1779, -0.3050, -0.0245, ..., 0.1755, -0.2620, 0.3097],
[-0.0193, 0.0550, -0.0951, ..., -0.0771, 0.0046, 0.0384],
...,
[-0.2811, -0.0509, -0.0340, ..., 0.4088, -0.0885, 0.1977],
[-0.1411, -0.2114, -0.0364, ..., 0.1844, -0.2052, -0.1303],
[-0.0397, -0.3287, 0.0959, ..., 0.3857, -0.2455, 0.3551]],
...,
[[-0.2905, -0.1199, -0.5113, ..., 0.0797, 0.0761, -0.1454],
[-0.3391, -0.4398, 0.1613, ..., 0.3521, -0.2897, 0.4688],
[ 0.0515, 0.1871, 0.2654, ..., 0.0055, 0.0177, -0.2444],
...,
[-0.5897, 0.2452, -0.1715, ..., 0.1403, 0.2739, 0.2423],
[ 0.0082, 0.3222, 0.1669, ..., 0.0938, 0.1326, -0.1318],
[-0.1835, -0.0591, 0.1662, ..., 0.1506, -0.1369, -0.0960]],
[[ 0.0088, 0.0562, -0.1568, ..., 0.0956, 0.1420, -0.0164],
[-0.1252, -0.3315, -0.0670, ..., 0.3029, -0.3670, 0.2253],
[ 0.1418, 0.0832, 0.1878, ..., 0.2082, -0.2881, 0.0064],
...,
[-0.3357, 0.0241, -0.2351, ..., 0.1009, 0.2384, 0.1972],
[ 0.1632, 0.0212, 0.1528, ..., 0.2441, -0.2813, -0.1012],
[-0.2424, -0.3850, 0.1242, ..., 0.2214, -0.4294, -0.2708]],
[[ 0.0734, -0.0391, -0.4524, ..., 0.0742, -0.0376, -0.1117],
[ 0.0506, -0.0210, 0.0115, ..., 0.0043, -0.1665, -0.0796],
[ 0.0133, -0.2106, -0.0142, ..., 0.5130, -0.2083, 0.1878],
...,
[-0.2460, -0.1284, -0.1073, ..., 0.2888, -0.2080, 0.0897],
[-0.1026, -0.2328, -0.1268, ..., 0.4177, -0.3034, 0.1005],
[-0.2828, -0.4220, 0.1543, ..., 0.3707, -0.5253, -0.1016]]],
grad_fn=<ViewBackward0>), pred_boxes=tensor([[[0.9500, 0.6381, 0.1323, 0.6838],
[0.6333, 0.0871, 0.1233, 0.0674],
[0.9906, 0.3960, 0.0203, 0.1109],
...,
[0.3539, 0.4133, 0.7001, 0.7790],
[0.9606, 0.3789, 0.0489, 0.0365],
[0.0161, 0.1030, 0.0344, 0.0579]],
[[0.7669, 0.9339, 0.5311, 0.1394],
[0.6464, 0.0556, 0.1167, 0.1030],
[0.9931, 0.5555, 0.0139, 0.1452],
...,
[0.3557, 0.4036, 0.2504, 0.1227],
[0.9974, 0.1280, 0.0060, 0.2545],
[0.0663, 0.3102, 0.1351, 0.1014]],
[[0.7914, 0.7499, 0.3953, 0.4964],
[0.6263, 0.0586, 0.2221, 0.0999],
[0.8788, 0.5819, 0.2370, 0.4087],
...,
[0.5177, 0.3094, 0.5866, 0.2528],
[0.8649, 0.4862, 0.2535, 0.2284],
[0.0075, 0.1096, 0.0162, 0.0404]],
...,
[[0.6732, 0.8154, 0.6100, 0.3546],
[0.6279, 0.0335, 0.0599, 0.0650],
[0.9707, 0.6607, 0.0617, 0.3039],
...,
[0.4202, 0.3927, 0.8342, 0.4869],
[0.9947, 0.7069, 0.0112, 0.4913],
[0.0305, 0.3756, 0.0607, 0.2126]],
[[0.8158, 0.7939, 0.3493, 0.3947],
[0.6333, 0.0566, 0.1713, 0.1139],
[0.9268, 0.5027, 0.1466, 0.1161],
...,
[0.3697, 0.3443, 0.7456, 0.7034],
[0.8904, 0.4677, 0.2055, 0.1397],
[0.0306, 0.3043, 0.0608, 0.0623]],
[[0.7667, 0.7876, 0.4440, 0.4146],
[0.6509, 0.2361, 0.0934, 0.0602],
[0.9449, 0.3689, 0.0946, 0.0297],
...,
[0.4404, 0.3215, 0.4841, 0.1150],
[0.8951, 0.3655, 0.1890, 0.0374],
[0.0411, 0.2678, 0.0848, 0.0570]]], grad_fn=<SigmoidBackward0>), auxiliary_outputs=None, last_hidden_state=tensor([[[-1.4020e-01, -1.5893e-01, 4.4403e-01, ..., 1.5252e-01,
2.8576e-01, 2.6249e-01],
[ 6.8369e-02, -2.7463e-01, -4.5402e-01, ..., -9.0982e-01,
-4.7036e-01, 7.2642e-01],
[-1.7512e-01, 3.1511e-01, 2.2512e-01, ..., -1.4200e-01,
2.5577e-01, 4.1778e-01],
...,
[ 1.5953e-01, -4.0302e-01, 2.2796e-01, ..., -8.1157e-01,
-3.6345e-01, -8.9928e-02],
[ 4.0930e-02, 6.6010e-04, 1.2503e-01, ..., -5.6554e-02,
3.2782e-01, 3.9761e-01],
[-1.5904e-02, 5.8626e-01, -1.3788e-01, ..., -7.4208e-01,
-1.3682e-01, 1.0417e-01]],
[[ 1.8487e-01, -2.6388e-01, 7.6519e-01, ..., -4.4617e-01,
1.6003e-01, 5.6029e-01],
[ 5.1641e-01, -5.4275e-02, 1.0399e+00, ..., -8.5620e-01,
-2.2614e-01, -2.9099e-01],
[-2.0582e-01, 2.9136e-01, 2.8441e-01, ..., -4.6227e-02,
2.9668e-01, 7.5241e-01],
...,
[ 3.8438e-01, 6.9957e-01, -5.8716e-01, ..., -9.2270e-01,
-4.5221e-02, -1.3225e-01],
[-6.4926e-02, 1.9942e-01, 4.3592e-01, ..., 3.0664e-02,
5.1831e-01, 3.6161e-01],
[ 4.8070e-01, -5.2024e-01, 2.0143e-01, ..., -1.5431e+00,
-3.6578e-01, -2.4390e-01]],
[[ 2.7832e-01, 7.0842e-02, 1.2050e+00, ..., -7.3184e-01,
1.7189e-01, 3.8562e-02],
[ 7.5524e-01, 1.0498e-01, 5.4896e-01, ..., -4.7316e-01,
7.8752e-03, 2.6307e-01],
[-3.7225e-01, 5.2872e-02, 5.6387e-01, ..., -1.3147e+00,
2.3460e-01, 4.7530e-01],
...,
[ 3.1748e-01, -1.0066e+00, 3.5116e-01, ..., -8.5966e-01,
-1.8258e-01, 2.6463e-01],
[-3.0530e-02, -1.0162e+00, 4.3357e-01, ..., -1.1250e+00,
-1.9363e-01, -7.9971e-02],
[ 3.0213e-01, 1.3661e-01, -6.4669e-01, ..., -5.1888e-01,
-6.2747e-02, 6.2570e-01]],
...,
[[ 3.9975e-01, -9.5206e-01, 8.8087e-01, ..., -8.3797e-01,
-3.4231e-02, 1.5127e-02],
[ 3.7874e-01, 4.6002e-01, 5.5632e-01, ..., -8.4079e-01,
3.5074e-01, -1.0479e-01],
[-2.1702e-01, -6.3238e-01, 3.0843e-01, ..., -4.9595e-01,
3.9976e-01, 7.5963e-01],
...,
[-9.0952e-02, -1.8212e+00, -7.9186e-02, ..., -1.0548e+00,
-7.6392e-02, 3.0424e-01],
[-5.6228e-02, -5.4257e-01, 3.7607e-01, ..., -1.8365e-01,
7.9351e-01, 1.0800e+00],
[ 8.0718e-02, -3.2467e-01, 3.0199e-02, ..., -1.0819e+00,
1.6267e-01, 4.1212e-01]],
[[ 4.9446e-01, -3.8678e-01, 9.7415e-01, ..., -9.0278e-01,
9.9647e-03, 4.2870e-02],
[ 7.2289e-01, 2.6472e-01, 6.9674e-01, ..., -8.4964e-01,
-3.5554e-01, -4.0242e-01],
[ 2.0905e-01, 1.7493e-01, 7.1425e-01, ..., -6.0879e-01,
-2.6598e-01, 5.8427e-01],
...,
[ 3.1929e-01, -1.3318e+00, 1.0949e+00, ..., -1.0937e+00,
-4.9580e-01, -4.8511e-01],
[ 2.8816e-01, 1.6738e-04, 1.1606e+00, ..., -7.3686e-01,
-2.4679e-01, 1.9954e-01],
[ 1.8261e-01, -1.2720e-02, -3.0613e-01, ..., -6.9232e-01,
-2.6717e-01, 1.7242e-01]],
[[ 1.1384e-01, 1.4387e-01, 3.6687e-02, ..., -7.7477e-01,
1.0376e-01, -2.5709e-01],
[ 3.4558e-01, -4.4018e-01, 3.6415e-01, ..., 1.7454e-01,
2.4093e-01, -4.9051e-02],
[ 1.7516e-01, -2.2057e-01, -1.2419e-01, ..., -1.5287e-01,
6.2450e-02, 4.9240e-02],
...,
[ 6.6910e-01, -3.4297e-01, -2.0511e-01, ..., -1.0155e+00,
7.9812e-03, 3.0636e-01],
[ 4.0032e-01, -3.4343e-01, 1.5294e-01, ..., -3.3256e-01,
-2.5672e-01, -1.9711e-01],
[-1.1014e-01, -4.8125e-01, 1.0338e-01, ..., -7.0084e-01,
4.9208e-03, 2.7278e-01]]], grad_fn=<NativeLayerNormBackward0>), decoder_hidden_states=None, decoder_attentions=None, cross_attentions=None, encoder_last_hidden_state=tensor([[[-0.3462, 0.1944, -0.1375, ..., -0.4447, 0.4016, 0.4290],
[ 0.0648, 0.2144, 0.0340, ..., 0.2365, 0.1294, 0.3575],
[ 0.1515, 0.5005, -0.0685, ..., -0.0157, 0.1598, 0.3866],
...,
[ 0.1488, 0.8020, -0.2199, ..., 0.2656, 0.0879, 0.2309],
[ 0.1548, 0.6870, -0.1847, ..., 0.3029, 0.0465, 0.2264],
[-0.1375, 0.4506, -0.2336, ..., -0.0616, 0.1774, 0.2659]],
[[-0.3369, 0.3608, -0.2942, ..., -0.4818, 0.4762, 0.3779],
[ 0.0714, 0.3084, 0.0148, ..., 0.0797, 0.2380, 0.3244],
[ 0.0873, 0.4330, -0.0352, ..., -0.2179, 0.2011, 0.2788],
...,
[-0.0706, 0.0146, 0.1921, ..., -0.1177, -0.1456, 0.0187],
[ 0.1120, 0.2591, 0.0263, ..., 0.1479, -0.0880, 0.0873],
[-0.1774, 0.3163, -0.0410, ..., 0.0425, 0.1321, 0.2753]],
[[-0.2815, 0.3443, -0.2270, ..., -0.5475, 0.2527, 0.3086],
[ 0.1719, 0.4588, -0.0811, ..., 0.0694, 0.0811, 0.3715],
[ 0.2389, 0.2392, -0.1076, ..., -0.1341, -0.2286, 0.2902],
...,
[ 0.2274, 0.4766, 0.0128, ..., 0.2001, 0.2571, 0.2773],
[ 0.2339, 0.5257, 0.0034, ..., 0.2795, 0.2356, 0.2127],
[-0.0985, 0.3517, -0.0659, ..., -0.0961, 0.3029, 0.1836]],
...,
[[-0.3820, 0.4122, -0.4279, ..., -0.4390, 0.4537, 0.3619],
[ 0.0776, 0.4093, -0.1319, ..., 0.3167, 0.1865, 0.4449],
[ 0.0644, 0.5139, -0.1786, ..., 0.1034, 0.1915, 0.3504],
...,
[-0.0715, 0.1232, 0.0057, ..., 0.2714, 0.0190, 0.1771],
[ 0.1267, 0.3740, 0.0213, ..., -0.0367, 0.0245, 0.2749],
[-0.1652, 0.1528, 0.1033, ..., -0.1985, 0.0891, 0.3079]],
[[-0.2655, 0.2723, -0.2191, ..., -0.3646, 0.3872, 0.2680],
[ 0.1672, 0.2333, -0.0337, ..., 0.2537, 0.2663, 0.3487],
[ 0.1631, 0.3007, -0.1148, ..., 0.1061, 0.1698, 0.2983],
...,
[ 0.1221, 0.1708, 0.0071, ..., 0.4499, -0.0821, 0.0854],
[ 0.1202, 0.0732, -0.0148, ..., 0.6552, -0.2320, 0.0461],
[-0.0094, 0.2407, 0.1013, ..., -0.1772, -0.1296, 0.0011]],
[[-0.3589, 0.4908, -0.3906, ..., -0.5620, 0.4539, 0.2588],
[ 0.1310, 0.5131, -0.0584, ..., 0.1296, 0.1215, 0.2423],
[ 0.1021, 0.6150, -0.0859, ..., -0.0818, 0.1724, 0.2820],
...,
[ 0.2026, 0.4986, 0.1082, ..., 0.1570, 0.1229, 0.1716],
[ 0.1716, 0.3375, 0.1374, ..., 0.4551, 0.0419, 0.0987],
[-0.0736, 0.2892, 0.0910, ..., -0.2655, 0.1247, 0.0657]]],
grad_fn=<NativeLayerNormBackward0>), encoder_hidden_states=None, encoder_attentions=None)
example_batch_outputs.keys()
odict_keys(['logits', 'pred_boxes', 'last_hidden_state', 'encoder_last_hidden_state'])
# We get 300 predictions per image in our batch, each with a logit value for each of the classes in our dataset
example_batch_outputs.logits.shape
torch.Size([32, 300, 7])
This is what will happen during training, our model will continually go over batches over data and try to match its own predictions with the ground truth labels.
14 TK - Setup TrainingArguments + Trainer
UPTOHERE - creating TrainingArguments + Trainer + Training a model
- TK - for hyperparameters, see example in RT-DETR paper: https://arxiv.org/pdf/2304.08069
- As well as DETR - https://arxiv.org/pdf/2005.12872 (see Appendix A.4)
- Try training for 25 epochs and see what happens
"validation"][0] processed_dataset[
{'pixel_values': tensor([[[ 0.1254, 0.1254, 0.1597, ..., -2.0837, -1.9809, -1.9295],
[ 0.1426, 0.1254, 0.1597, ..., -2.0494, -1.9638, -1.9467],
[ 0.1426, 0.1426, 0.1597, ..., -1.9467, -1.9295, -1.9467],
...,
[ 1.2899, 1.0502, 1.1358, ..., 0.7248, 0.7933, 0.7762],
[ 1.4098, 1.1872, 1.0331, ..., 0.7077, 0.7419, 0.7419],
[ 1.2728, 0.9646, 0.9303, ..., 0.7077, 0.7591, 0.7248]],
[[ 1.2206, 1.1856, 1.1506, ..., -1.9832, -1.8782, -1.7731],
[ 1.2381, 1.1856, 1.1506, ..., -1.9657, -1.8606, -1.8256],
[ 1.2381, 1.2031, 1.1681, ..., -1.8606, -1.8256, -1.8431],
...,
[ 1.2906, 1.0630, 1.1506, ..., 0.3803, 0.4503, 0.4328],
[ 1.4307, 1.2031, 1.0280, ..., 0.3627, 0.3978, 0.3978],
[ 1.2906, 0.9755, 0.9230, ..., 0.3627, 0.4153, 0.3803]],
[[ 2.1346, 2.2217, 2.1868, ..., -1.7173, -1.6127, -1.5604],
[ 2.1520, 2.2217, 2.1868, ..., -1.6999, -1.5953, -1.5779],
[ 2.1694, 2.2217, 2.1868, ..., -1.5953, -1.5430, -1.5604],
...,
[ 1.2108, 0.9842, 1.0539, ..., 0.3568, 0.4265, 0.4091],
[ 1.3154, 1.0888, 0.9494, ..., 0.3393, 0.3742, 0.3742],
[ 1.1759, 0.8622, 0.8448, ..., 0.3393, 0.3916, 0.3568]]]),
'pixel_mask': tensor([[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1],
...,
[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1]]),
'labels': {'size': tensor([640, 480]), 'image_id': tensor([719]), 'class_labels': tensor([4, 4, 1, 5, 0, 0]), 'boxes': tensor([[0.1898, 0.1767, 0.2161, 0.1620],
[0.5669, 0.1938, 0.0742, 0.0805],
[0.7672, 0.7768, 0.4526, 0.4327],
[0.4715, 0.6213, 0.2235, 0.1502],
[0.3973, 0.5639, 0.7729, 0.6337],
[0.6906, 0.4581, 0.5110, 0.4600]]), 'area': tensor([ 10753.6875, 1833.4000, 60167.3867, 10316.8945, 150459.0469,
72216.3203]), 'iscrowd': tensor([0, 0, 0, 0, 0, 0]), 'orig_size': tensor([1280, 960])}}
# Note: Depending on the size/speed of your GPU, this may take a while
from transformers import TrainingArguments, Trainer
# Set the batch size according to the memory you have available on your GPU
# e.g. on my NVIDIA RTX 4090 with 24GB of VRAM, I can use a batch size of 32 without running out of memory
= 16
BATCH_SIZE
# Note: AdamW Optimizer is used by default
= TrainingArguments(
training_args ="detr_finetuned_trashify_box_detector",
output_dir=25,
num_train_epochs=True,
fp16=BATCH_SIZE,
per_device_train_batch_size=BATCH_SIZE,
per_device_eval_batch_size=1e-4,
learning_rate="linear",
lr_scheduler_type=1e-4,
weight_decay=0.01,
max_grad_norm="eval_loss",
metric_for_best_model=False,
greater_is_better="epoch",
eval_strategy="epoch",
save_strategy="epoch",
logging_strategy=2,
save_total_limit=False,
remove_unused_columns="none", # don't save experiments to a third party service
report_to=4, # note: if you're on Google Colab, you may have to lower this to os.cpu_count() or to 0
dataloader_num_workers=0.05, # learning rate warmup
warmup_ratio=False,
push_to_hub=False
eval_do_concat_batches
)
= Trainer(
model_v1_trainer =model,
model=training_args,
args=processed_dataset["train"],
train_dataset=processed_dataset["validation"],
eval_dataset=image_processor,
tokenizer=data_collate_function,
data_collator# compute_metrics=None # TODO: TK - can add a metrics function, just see if model trains first, see here for an example: https://github.com/huggingface/transformers/blob/336dc69d63d56f232a183a3e7f52790429b871ef/examples/pytorch/object-detection/run_object_detection.py#L160
)
= model_v1_trainer.train() model_v1_results
/home/daniel/miniconda3/envs/ai/lib/python3.11/site-packages/accelerate/accelerator.py:488: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.
self.scaler = torch.cuda.amp.GradScaler(**kwargs)
Epoch | Training Loss | Validation Loss |
---|---|---|
1 | 101.878300 | 7.513162 |
2 | 4.145500 | 3.055572 |
3 | 2.596400 | 2.273679 |
4 | 2.277300 | 2.069138 |
5 | 2.081800 | 1.849403 |
6 | 1.925300 | 1.687234 |
7 | 1.780200 | 1.603322 |
8 | 1.675000 | 1.451112 |
9 | 1.526300 | 1.409718 |
10 | 1.432200 | 1.339651 |
11 | 1.386000 | 1.289711 |
12 | 1.309800 | 1.281332 |
13 | 1.248000 | 1.209565 |
14 | 1.209000 | 1.220024 |
15 | 1.175700 | 1.198685 |
16 | 1.144000 | 1.175700 |
17 | 1.073200 | 1.193522 |
18 | 1.050100 | 1.153087 |
19 | 0.986400 | 1.157631 |
20 | 0.994100 | 1.151300 |
21 | 0.958900 | 1.144987 |
22 | 0.927900 | 1.135496 |
23 | 0.907100 | 1.123257 |
24 | 0.885100 | 1.133819 |
25 | 0.870900 | 1.130216 |
TK - Note: May get an error at the beginning where a box is predicted a negative output. This will break training as boxes are expected to be positive floats.
15 TK - Make predictions on the test dataset
"test"][0] processed_dataset[
{'pixel_values': tensor([[[-0.9705, -0.7308, -0.9705, ..., -1.8953, -1.8268, -1.3130],
[-1.2959, -0.9363, -0.3883, ..., -1.8953, -1.7240, -0.5596],
[-1.4843, -1.1418, -0.1999, ..., -1.8782, -1.2788, -0.5424],
...,
[ 1.3242, 1.3242, 1.3413, ..., -0.6452, -0.2856, -0.9877],
[ 1.3070, 1.3584, 1.4098, ..., -0.8678, 0.0398, -0.4911],
[ 1.2728, 1.3413, 1.4098, ..., -0.9705, 0.1768, -0.1657]],
[[-0.5476, -0.3550, -0.6527, ..., -1.7031, -1.6155, -1.0903],
[-0.8803, -0.5476, -0.0399, ..., -1.6856, -1.5280, -0.3200],
[-1.0728, -0.7402, 0.1527, ..., -1.6506, -1.0553, -0.3025],
...,
[-1.7031, -1.7031, -1.6856, ..., -0.3901, 0.0301, -0.7227],
[-1.7206, -1.6681, -1.6155, ..., -0.6176, 0.3803, -0.1800],
[-1.7556, -1.6856, -1.6155, ..., -0.7052, 0.5203, 0.1527]],
[[-1.0550, -0.7064, -0.8284, ..., -1.6824, -1.5953, -1.0201],
[-1.3861, -0.9504, -0.2881, ..., -1.6999, -1.4559, -0.2532],
[-1.6476, -1.1944, -0.1661, ..., -1.6650, -1.0376, -0.2881],
...,
[-1.2641, -1.2641, -1.2467, ..., -0.9504, -0.8284, -1.2293],
[-1.2816, -1.2293, -1.1770, ..., -1.1596, -0.5321, -1.0027],
[-1.3164, -1.2467, -1.1770, ..., -1.3513, -0.4973, -0.8981]]]),
'pixel_mask': tensor([[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1],
...,
[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1]]),
'labels': {'size': tensor([640, 480]), 'image_id': tensor([61]), 'class_labels': tensor([4, 5, 1, 0]), 'boxes': tensor([[0.2104, 0.8563, 0.2855, 0.2720],
[0.4194, 0.4927, 0.2398, 0.1785],
[0.3610, 0.6227, 0.2706, 0.2330],
[0.4974, 0.4785, 0.3829, 0.3820]]), 'area': tensor([23860.4043, 13150.1748, 19368.0898, 44929.9102]), 'iscrowd': tensor([0, 0, 0, 0]), 'orig_size': tensor([1280, 960])}}
# Make predictions with trainer containing trained model
= model_v1_trainer.predict(test_dataset=processed_dataset["test"])
test_dataset_preds # test_dataset_preds
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[39], line 2 1 # Make predictions with trainer containing trained model ----> 2 test_dataset_preds = model_v1_trainer.predict(test_dataset=processed_dataset["test"]) 3 # test_dataset_preds NameError: name 'model_v1_trainer' is not defined
# Get the logits
= test_dataset_preds.predictions[0][1]
test_pred_logits
# Get the boxes
= test_dataset_preds.predictions[0][2]
test_pred_boxes
# Get the label IDs
= test_dataset_preds.label_ids
test_pred_label_ids
# Check shapes
len(test_pred_label_ids) test_pred_logits.shape, test_pred_boxes.shape,
((16, 300, 7), (16, 300, 4), 13)
%%time
# Get a random sample from the test preds
= random.randint(0, len(processed_dataset["test"]))
random_test_pred_index print(f"[INFO] Making predictions on test item with index: {random_test_pred_index}")
= processed_dataset["test"][random_test_pred_index]
random_test_sample
# Do a single forward pass with the model
= model(pixel_values=random_test_sample["pixel_values"].unsqueeze(0).to("cuda"), # model expects input [batch_size, color_channels, height, width]
random_test_sample_outputs =None)
pixel_mask# random_test_sample_outputs
[INFO] Making predictions on test item with index: 163
CPU times: user 51.5 ms, sys: 10.3 ms, total: 61.8 ms
Wall time: 63.1 ms
# image_processor.preprocess?
# Get a random sample from the test preds
= random.randint(0, len(processed_dataset["test"]))
random_test_pred_index print(f"[INFO] Making predictions on test item with index: {random_test_pred_index}")
= processed_dataset["test"][random_test_pred_index]
random_test_sample
# # Do a single forward pass with the model
= model(pixel_values=random_test_sample["pixel_values"].unsqueeze(0).to("cuda"), # model expects input [batch_size, color_channels, height, width]
random_test_sample_outputs =None)
pixel_mask
# Post process a random item from test preds
= image_processor.post_process_object_detection(
random_test_sample_outputs_post_processed =random_test_sample_outputs,
outputs=0.25, # prediction probability threshold for boxes (note: boxes from an untrained model will likely be bad)
threshold=[random_test_sample["labels"]["orig_size"]] # original input image size (or whichever target size you'd like), required to be same number of input items in a list
target_sizes
)
# Plot the random sample test preds
# Extract scores, labels and boxes
= random_test_sample_outputs_post_processed[0]["scores"]
random_test_sample_pred_scores = random_test_sample_outputs_post_processed[0]["labels"]
random_test_sample_pred_labels = random_test_sample_outputs_post_processed[0]["boxes"]
random_test_sample_pred_boxes
# Create a list of labels to plot on the boxes
= [f"Pred: {id2label[label_pred.item()]} ({round(score_pred.item(), 4)})"
random_test_sample_labels_to_plot for label_pred, score_pred in zip(random_test_sample_pred_labels, random_test_sample_pred_scores)]
print(f"[INFO] Labels with scores: {random_test_sample_labels_to_plot}")
# Plot the predicted boxes on the random test image
to_pil_image(=draw_bounding_boxes(
pic=pil_to_tensor(pic=dataset["test"][random_test_pred_index]["image"]),
image=random_test_sample_pred_boxes,
boxes=random_test_sample_labels_to_plot,
labels=3
width
) )
[INFO] Making predictions on test item with index: 28
[INFO] Labels with scores: ['Pred: hand (0.4208)', 'Pred: trash (0.3352)']
TK - nice!!! these boxes look far better than our randomly predicted boxes with an untrained model…
15.1 TK - Predict on image from filepath
# Pred on image from pathname
from pathlib import Path
from PIL import Image
= Path("data/trashify_test_images")
path_to_test_image_folder = list(path_to_test_image_folder.rglob("*.jp*g"))
test_image_filepaths = random.choice(test_image_filepaths)
test_image_targ_filepath # test_image_targ_filepath = "data/trashify_test_images/IMG_6692.jpeg"
= Image.open(test_image_targ_filepath)
test_image_pil = image_processor.preprocess(images=test_image_pil,
test_image_preprocessed ="pt")
return_tensors
def get_image_dimensions_from_pil(image: Image.Image) -> torch.tensor:
"""
Convert the dimensions of a PIL image to a PyTorch tensor in the order (height, width).
Args:
image (Image.Image): The input PIL image.
Returns:
torch.Tensor: A tensor containing the height and width of the image.
"""
# Get (width, height) of image (PIL.Image.size returns width, height)
= image.size
width, height
# Convert to a tensor in the order (height, width)
= torch.tensor([height, width])
image_dimensions_tensor
return image_dimensions_tensor
# Get image original size
= get_image_dimensions_from_pil(image=test_image_pil)
test_image_size
# Make predictions on the preprocessed image
= model(pixel_values=test_image_preprocessed["pixel_values"].to("cuda"), # model expects input [batch_size, color_channels, height, width]
random_test_sample_outputs =None)
pixel_mask
= 0.2
THRESHOLD
# Post process the predictions
= image_processor.post_process_object_detection(
random_test_sample_outputs_post_processed =random_test_sample_outputs,
outputs=THRESHOLD,
threshold=[test_image_size] # needs to be same length as batch dimension of the logits (e.g. [[height, width]])
target_sizes
)
# Extract scores, labels and boxes
= random_test_sample_outputs_post_processed[0]["scores"]
random_test_sample_pred_scores = random_test_sample_outputs_post_processed[0]["labels"]
random_test_sample_pred_labels = random_test_sample_outputs_post_processed[0]["boxes"]
random_test_sample_pred_boxes
# Create a lsit of labels to plot on the boxes
= [f"Pred: {id2label[label_pred.item()]} ({round(score_pred.item(), 4)})"
random_test_sample_labels_to_plot for label_pred, score_pred in zip(random_test_sample_pred_labels, random_test_sample_pred_scores)]
print("[INFO] Labels with scores:")
for item in random_test_sample_labels_to_plot:
print(item)
# Plot the predicted boxes on the random test image
to_pil_image(=draw_bounding_boxes(
pic=pil_to_tensor(pic=test_image_pil),
image=random_test_sample_pred_boxes,
boxes=random_test_sample_labels_to_plot,
labels=3
width
)
)
# # Plot the random sample image with randomly predicted boxes (these will be very poor since the model is not trained on our data yet)
# to_pil_image(
# pic=draw_bounding_boxes(
# image=pil_to_tensor(pic=dataset["test"][random_test_pred_index]["image"]),
# boxes=random_test_sample_pred_boxes,
# labels=random_test_sample_labels_to_plot,
# width=3
# )
# )
[INFO] Labels with scores:
Pred: trash (0.7138)
Pred: bin (0.699)
Pred: hand (0.6244)
Pred: bin (0.6231)
Pred: not_trash (0.4189)
Pred: bin (0.2655)
Pred: hand (0.2617)
Pred: not_trash (0.2392)
Pred: not_trash (0.2335)
16 TK - Upload our trained model to Hugging Face Hub
TK - Let’s make our model available for others to use.
# UPTOHERE
# Make extensions to make the model better... (e.g. data augmentation = harder training set = better overall validation loss)
# Model with data augmentation
# Model with longer training (e.g. 100 epochs)
# Research eval_do_concat_batches=False/True & see what the results do...
# Save the model
from datetime import datetime
# TODO: update this save path so we know when the model was saved and what its parameters were
= training_args.num_train_epochs
training_epochs_ = "{:.0e}".format(training_args.learning_rate)
learning_rate_
= f"models/learn_hf_microsoft_detr_finetuned_trashify_box_dataset_only_manual_data_no_aug_{training_epochs_}_epochs_lr_{learning_rate_}"
model_save_path print(f"[INFO] Saving model to: {model_save_path}")
model_v1_trainer.save_model(model_save_path)
[INFO] Saving model to: models/learn_hf_microsoft_detr_finetuned_trashify_box_dataset_only_manual_data_no_aug_25_epochs_lr_1e-04
# Push the model to the hub
# Note: this will require you to have your Hugging Face account setup
="upload trashify object detection model",
model_v1_trainer.push_to_hub(commit_message# token=None # Optional to add a token manually
)
CommitInfo(commit_url='https://huggingface.co/mrdbourke/detr_finetuned_trashify_box_detector/commit/ab273cec67e5124ac047dc1e068c379c718e6c37', commit_message='upload trashify object detection model', commit_description='', oid='ab273cec67e5124ac047dc1e068c379c718e6c37', pr_url=None, repo_url=RepoUrl('https://huggingface.co/mrdbourke/detr_finetuned_trashify_box_detector', endpoint='https://huggingface.co', repo_type='model', repo_id='mrdbourke/detr_finetuned_trashify_box_detector'), pr_revision=None, pr_num=None)
17 Creating a demo of our model with Gradio
%%writefile demos/trashify_object_detector/README.md
---
title: Trashify Demo V1 🚮
emoji: 🗑️
colorFrom: purple
colorTo: blue
sdk: gradio4.40.0
sdk_version:
app_file: app.py
pinned: false-2.0
license: apache---
# 🚮 Trashify Object Detector V1
Object detection demo to detect `trash`, `bin`, `hand`, `trash_arm`, `not_trash`, `not_bin`, `not_hand`.
as example for encouraging people to cleanup their local area.
Used
all detected = +1 point.
If `trash`, `hand`, `bin`
## Dataset
-labelled dataset of people picking up trash and placing it in a bin.
All Trashify models are trained on a custom hand
as [`mrdbourke/trashify_manual_labelled_images`](https://huggingface.co/datasets/mrdbourke/trashify_manual_labelled_images).
The dataset can be found on Hugging Face
## Demos
* [V1](https://huggingface.co/spaces/mrdbourke/trashify_demo_v1) = Fine-tuned DETR model trained *without* data augmentation.
* [V2](https://huggingface.co/spaces/mrdbourke/trashify_demo_v2) = Fine-tuned DETR model trained *with* data augmentation.
* [V3](https://huggingface.co/spaces/mrdbourke/trashify_demo_v3) = Fine-tuned DETR model trained *with* data augmentation (same as V2) with an NMS (Non Maximum Suppression) post-processing step.
- add links to resources to learn more TK
Overwriting demos/trashify_object_detector/README.md
%%writefile demos/trashify_object_detector/requirements.txt
timm
gradio
torch transformers
Overwriting demos/trashify_object_detector/requirements.txt
%%writefile demos/trashify_object_detector/app.py
import gradio as gr
import torch
from PIL import Image, ImageDraw, ImageFont
from transformers import AutoImageProcessor
from transformers import AutoModelForObjectDetection
# Note: Can load from Hugging Face or can load from local
= "mrdbourke/detr_finetuned_trashify_box_detector"
model_save_path
# Load the model and preprocessor
= AutoImageProcessor.from_pretrained(model_save_path)
image_processor = AutoModelForObjectDetection.from_pretrained(model_save_path)
model
= "cuda" if torch.cuda.is_available() else "cpu"
device = model.to(device)
model
# Get the id2label dictionary from the model
= model.config.id2label
id2label
# Set up a colour dictionary for plotting boxes with different colours
= {
color_dict "bin": "green",
"trash": "blue",
"hand": "purple",
"trash_arm": "yellow",
"not_trash": "red",
"not_bin": "red",
"not_hand": "red",
}
# Create helper functions for seeing if items from one list are in another
def any_in_list(list_a, list_b):
"Returns True if any item from list_a is in list_b, otherwise False."
return any(item in list_b for item in list_a)
def all_in_list(list_a, list_b):
"Returns True if all items from list_a are in list_b, otherwise False."
return all(item in list_b for item in list_a)
def predict_on_image(image, conf_threshold):
with torch.no_grad():
= image_processor(images=[image], return_tensors="pt")
inputs = model(**inputs.to(device))
outputs
= torch.tensor([[image.size[1], image.size[0]]]) # height, width
target_sizes
= image_processor.post_process_object_detection(outputs,
results =conf_threshold,
threshold=target_sizes)[0]
target_sizes# Return all items in results to CPU
for key, value in results.items():
try:
= value.item().cpu() # can't get scalar as .item() so add try/except block
results[key] except:
= value.cpu()
results[key]
# Can return results as plotted on a PIL image (then display the image)
= ImageDraw.Draw(image)
draw
# Get a font from ImageFont
= ImageFont.load_default(size=20)
font
# Get class names as text for print out
= []
class_name_text_labels
for box, score, label in zip(results["boxes"], results["scores"], results["labels"]):
# Create coordinates
= tuple(box.tolist())
x, y, x2, y2
# Get label_name
= id2label[label.item()]
label_name = color_dict[label_name]
targ_color
class_name_text_labels.append(label_name)
# Draw the rectangle
=(x, y, x2, y2),
draw.rectangle(xy=targ_color,
outline=3)
width
# Create a text string to display
= f"{label_name} ({round(score.item(), 3)})"
text_string_to_show
# Draw the text on the image
=(x, y),
draw.text(xy=text_string_to_show,
text="white",
fill=font)
font
# Remove the draw each time
del draw
# Setup blank string to print out
= ""
return_string
# Setup list of target items to discover
= ["trash", "bin", "hand"]
target_items
# If no items detected or trash, bin, hand not in list, return notification
if (len(class_name_text_labels) == 0) or not (any_in_list(list_a=target_items, list_b=class_name_text_labels)):
= f"No trash, bin or hand detected at confidence threshold {conf_threshold}. Try another image or lowering the confidence threshold."
return_string return image, return_string
# If there are some missing, print the ones which are missing
elif not all_in_list(list_a=target_items, list_b=class_name_text_labels):
= []
missing_items for item in target_items:
if item not in class_name_text_labels:
missing_items.append(item)= f"Detected the following items: {class_name_text_labels}. But missing the following in order to get +1: {missing_items}. If this is an error, try another image or altering the confidence threshold. Otherwise, the model may need to be updated with better data."
return_string
# If all 3 trash, bin, hand occur = + 1
if all_in_list(list_a=target_items, list_b=class_name_text_labels):
= f"+1! Found the following items: {class_name_text_labels}, thank you for cleaning up the area!"
return_string
print(return_string)
return image, return_string
# Create the interface
= gr.Interface(
demo =predict_on_image,
fn=[
inputstype="pil", label="Target Image"),
gr.Image(=0, maximum=1, value=0.25, label="Confidence Threshold")
gr.Slider(minimum
],=[
outputstype="pil", label="Image Output"),
gr.Image(="Text Output")
gr.Text(label
],="🚮 Trashify Object Detection Demo V1",
title="Help clean up your local area! Upload an image and get +1 if there is all of the following items detected: trash, bin, hand.",
description# Examples come in the form of a list of lists, where each inner list contains elements to prefill the `inputs` parameter with
=[
examples"examples/trashify_example_1.jpeg", 0.25],
["examples/trashify_example_2.jpeg", 0.25],
["examples/trashify_example_3.jpeg", 0.25],
[
],=True
cache_examples
)
# Launch the demo
demo.launch()
Overwriting demos/trashify_object_detector/app.py
17.1 TK - Upload demo to Hugging Face Spaces to get it live
# 1. Import the required methods for uploading to the Hugging Face Hub
from huggingface_hub import (
create_repo,
get_full_repo_name,# for uploading a single file (if necessary)
upload_file, # for uploading multiple files (in a folder)
upload_folder
)
# 2. Define the parameters we'd like to use for the upload
= "demos/trashify_object_detector" # TK - update this path
LOCAL_DEMO_FOLDER_PATH_TO_UPLOAD = "trashify_demo_v1"
HF_TARGET_SPACE_NAME = "space" # we're creating a Hugging Face Space
HF_REPO_TYPE = "gradio"
HF_SPACE_SDK = "" # optional: set to your Hugging Face token (but I'd advise storing this as an environment variable as previously discussed)
HF_TOKEN
# 3. Create a Space repository on Hugging Face Hub
print(f"[INFO] Creating repo on Hugging Face Hub with name: {HF_TARGET_SPACE_NAME}")
create_repo(=HF_TARGET_SPACE_NAME,
repo_id# token=HF_TOKEN, # optional: set token manually (though it will be automatically recognized if it's available as an environment variable)
=HF_REPO_TYPE,
repo_type=False, # set to True if you don't want your Space to be accessible to others
private=HF_SPACE_SDK,
space_sdk=True, # set to False if you want an error to raise if the repo_id already exists
exist_ok
)
# 4. Get the full repository name (e.g. {username}/{model_id} or {username}/{space_name})
= get_full_repo_name(model_id=HF_TARGET_SPACE_NAME)
full_hf_repo_name print(f"[INFO] Full Hugging Face Hub repo name: {full_hf_repo_name}")
# 5. Upload our demo folder
print(f"[INFO] Uploading {LOCAL_DEMO_FOLDER_PATH_TO_UPLOAD} to repo: {full_hf_repo_name}")
= upload_folder(
folder_upload_url =full_hf_repo_name,
repo_id=LOCAL_DEMO_FOLDER_PATH_TO_UPLOAD,
folder_path=".", # upload our folder to the root directory ("." means "base" or "root", this is the default)
path_in_repo# token=HF_TOKEN, # optional: set token manually
=HF_REPO_TYPE,
repo_type="Uploading Trashify box detection model app.py"
commit_message
)print(f"[INFO] Demo folder successfully uploaded with commit URL: {folder_upload_url}")
[INFO] Creating repo on Hugging Face Hub with name: trashify_demo_v1
[INFO] Full Hugging Face Hub repo name: mrdbourke/trashify_demo_v1
[INFO] Uploading demos/trashify_object_detector to repo: mrdbourke/trashify_demo_v1
[INFO] Demo folder successfully uploaded with commit URL: https://huggingface.co/spaces/mrdbourke/trashify_demo_v1/tree/main/.
TK - see the demo here: https://huggingface.co/spaces/mrdbourke/trashify_demo_v1
17.2 TK - Testing the hosted demo
from IPython.display import HTML
# You can get embeddable HTML code for your demo by clicking the "Embed" button on the demo page
='''
HTML(data<iframe
src="https://mrdbourke-trashify-demo-v1.hf.space"
frameborder="0"
width="850"
height="1000"
></iframe>
''')
18 TK - Improve our model with data augmentation
UPTOHERE - Read for object detection augmentation (keep it simple) - Check out the papers for detection augmentation - Train a model with data augmentation - Compare the model’s metrics between data augmentation and no data augmentation
18.1 Load dataset
from datasets import load_dataset
# load_dataset?
= load_dataset(path="mrdbourke/trashify_manual_labelled_images")
dataset
print(f"[INFO] Length of original dataset: {len(dataset['train'])}")
# Split the data
= dataset["train"].train_test_split(test_size=0.3, seed=42) # split the dataset into 70/30 train/test
dataset_split = dataset_split["test"].train_test_split(test_size=0.6, seed=42) # split the test set into 40/60 validation/test
dataset_test_val_split
# Create splits
"train"] = dataset_split["train"]
dataset["validation"] = dataset_test_val_split["train"]
dataset["test"] = dataset_test_val_split["test"]
dataset[
dataset
[INFO] Length of original dataset: 1128
DatasetDict({
train: Dataset({
features: ['image', 'image_id', 'annotations', 'label_source', 'image_source'],
num_rows: 789
})
validation: Dataset({
features: ['image', 'image_id', 'annotations', 'label_source', 'image_source'],
num_rows: 135
})
test: Dataset({
features: ['image', 'image_id', 'annotations', 'label_source', 'image_source'],
num_rows: 204
})
})
# Get the categories from the dataset
# Note: this requires the dataset to have been uploaded with this feature setup
= dataset["train"].features["annotations"].feature["category_id"]
categories
# Get the names attribute
categories.names
['bin', 'hand', 'not_bin', 'not_hand', 'not_trash', 'trash', 'trash_arm']
= {i: class_name for i, class_name in enumerate(categories.names)}
id2label = {value: key for key, value in id2label.items()}
label2id
id2label, label2id
({0: 'bin',
1: 'hand',
2: 'not_bin',
3: 'not_hand',
4: 'not_trash',
5: 'trash',
6: 'trash_arm'},
{'bin': 0,
'hand': 1,
'not_bin': 2,
'not_hand': 3,
'not_trash': 4,
'trash': 5,
'trash_arm': 6})
# View a random sample
import random
= random.randint(0, len(dataset["train"]))
random_idx = dataset["train"][random_idx]
random_sample random_sample
{'image': <PIL.Image.Image image mode=RGB size=960x1280>,
'image_id': 955,
'annotations': {'file_name': ['ed8cb1ab-2882-4ab7-a839-c53fa2908a72.jpeg',
'ed8cb1ab-2882-4ab7-a839-c53fa2908a72.jpeg',
'ed8cb1ab-2882-4ab7-a839-c53fa2908a72.jpeg',
'ed8cb1ab-2882-4ab7-a839-c53fa2908a72.jpeg'],
'image_id': [955, 955, 955, 955],
'category_id': [5, 1, 0, 4],
'bbox': [[464.79998779296875, 625.5999755859375, 68.30000305175781, 92.5],
[483.0, 686.2000122070312, 173.0, 247.3000030517578],
[102.80000305175781, 361.70001220703125, 813.5, 734.0],
[325.29998779296875,
716.5999755859375,
189.60000610351562,
215.3000030517578]],
'iscrowd': [0, 0, 0, 0],
'area': [6317.75, 42782.8984375, 597109.0, 40820.87890625]},
'label_source': 'manual_prodigy_label',
'image_source': 'manual_taken_photo'}
18.2 Setup model
from transformers import AutoModelForObjectDetection, AutoImageProcessor
# Model config - https://huggingface.co/docs/transformers/main/en/model_doc/conditional_detr#transformers.ConditionalDetrConfig
# Model docs - https://huggingface.co/docs/transformers/main/en/model_doc/conditional_detr#transformers.ConditionalDetrModel
= "microsoft/conditional-detr-resnet-50"
MODEL_NAME
# Set image size
= 640 # other common image sizes include: 300x300, 480x480, 512x512, 640x640, 800x800 (best to experiment and see which works best)
IMAGE_SIZE
# Get the image processor (this is required for prepraring images)
# See docs: https://huggingface.co/docs/transformers/main/en/model_doc/conditional_detr#transformers.ConditionalDetrImageProcessor.preprocess
= AutoImageProcessor.from_pretrained(
image_processor =MODEL_NAME,
pretrained_model_name_or_pathformat="coco_detection", # this is the default
=True, # defaults to True, converts boxes to (center_x, center_y, width, height)
do_convert_annotations={"shortest_edge": IMAGE_SIZE, "longest_edge": IMAGE_SIZE},
size=None # Note: this parameter is deprecated and will produce a warning if used during processing.
max_size
)
# Check out the image processor
image_processor
ConditionalDetrImageProcessor {
"do_convert_annotations": true,
"do_normalize": true,
"do_pad": true,
"do_rescale": true,
"do_resize": true,
"format": "coco_detection",
"image_mean": [
0.485,
0.456,
0.406
],
"image_processor_type": "ConditionalDetrImageProcessor",
"image_std": [
0.229,
0.224,
0.225
],
"pad_size": null,
"resample": 2,
"rescale_factor": 0.00392156862745098,
"size": {
"longest_edge": 640,
"shortest_edge": 640
}
}
# First create a couple of dataclasses to store our data format
from dataclasses import dataclass, asdict
from typing import List, Tuple
@dataclass
class SingleCOCOAnnotation:
"An instance of a single COCO annotation. See COCO format: https://cocodataset.org/#format-data"
int
image_id: int
category_id: float] # bboxes in format [x_top_left, y_top_left, width, height]
bbox: List[float = 0.0
area: int = 0
iscrowd:
@dataclass
class ImageCOCOAnnotations:
"A collection of COCO annotations for a given image_id."
int
image_id:
annotations: List[SingleCOCOAnnotation]
def format_image_annotations_as_coco(
int,
image_id: int],
categories: List[float],
areas: List[float, float, float, float]] # bboxes in format
bboxes: List[Tuple[-> dict:
) # Turn input lists into a list of dicts
= [
coco_format_annotations
asdict(SingleCOCOAnnotation(=image_id,
image_id=category,
category_id=list(bbox),
bbox=area,
area
))for category, area, bbox in zip(categories, areas, bboxes)
]
# Return dictionary of annotations with format {"image_id": ..., "annotations": ...}
return asdict(ImageCOCOAnnotations(image_id=image_id,
=coco_format_annotations))
annotations
# Let's try it out
= 0
image_id = random.randint(0, len(dataset["train"]))
random_idx = dataset["train"][random_idx]
random_sample = random_sample["annotations"]["category_id"]
random_sample_categories = random_sample["annotations"]["area"]
random_sample_areas = random_sample["annotations"]["bbox"]
random_sample_bboxes
= format_image_annotations_as_coco(image_id=image_id,
random_sample_coco_annotations =random_sample_categories,
categories=random_sample_areas,
areas=random_sample_bboxes)
bboxes random_sample_coco_annotations
{'image_id': 0,
'annotations': [{'image_id': 0,
'category_id': 0,
'bbox': [452.79998779296875,
446.6000061035156,
272.70001220703125,
388.20001220703125],
'area': 105862.140625,
'iscrowd': 0},
{'image_id': 0,
'category_id': 0,
'bbox': [146.5, 487.5, 348.3999938964844, 424.79998779296875],
'area': 148000.3125,
'iscrowd': 0},
{'image_id': 0,
'category_id': 0,
'bbox': [8.300000190734863, 522.5, 241.3000030517578, 505.0],
'area': 121856.5,
'iscrowd': 0}]}
# Setup the model
# TODO: Can functionize this to create a base model (e.g. a model with all the base settings/untrained weights)
def create_model():
= AutoModelForObjectDetection.from_pretrained(
model =MODEL_NAME,
pretrained_model_name_or_path=label2id,
label2id=id2label,
id2label=True,
ignore_mismatched_sizes="resnet50")
backbonereturn model
= create_model()
model_aug model_aug
Some weights of ConditionalDetrForObjectDetection were not initialized from the model checkpoint at microsoft/conditional-detr-resnet-50 and are newly initialized because the shapes did not match:
- class_labels_classifier.bias: found shape torch.Size([91]) in the checkpoint and torch.Size([7]) in the model instantiated
- class_labels_classifier.weight: found shape torch.Size([91, 256]) in the checkpoint and torch.Size([7, 256]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
ConditionalDetrForObjectDetection(
(model): ConditionalDetrModel(
(backbone): ConditionalDetrConvModel(
(conv_encoder): ConditionalDetrConvEncoder(
(model): FeatureListNet(
(conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
(bn1): ConditionalDetrFrozenBatchNorm2d()
(act1): ReLU(inplace=True)
(maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
(layer1): Sequential(
(0): Bottleneck(
(conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): ConditionalDetrFrozenBatchNorm2d()
(act1): ReLU(inplace=True)
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): ConditionalDetrFrozenBatchNorm2d()
(drop_block): Identity()
(act2): ReLU(inplace=True)
(aa): Identity()
(conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): ConditionalDetrFrozenBatchNorm2d()
(act3): ReLU(inplace=True)
(downsample): Sequential(
(0): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): ConditionalDetrFrozenBatchNorm2d()
)
)
(1): Bottleneck(
(conv1): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): ConditionalDetrFrozenBatchNorm2d()
(act1): ReLU(inplace=True)
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): ConditionalDetrFrozenBatchNorm2d()
(drop_block): Identity()
(act2): ReLU(inplace=True)
(aa): Identity()
(conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): ConditionalDetrFrozenBatchNorm2d()
(act3): ReLU(inplace=True)
)
(2): Bottleneck(
(conv1): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): ConditionalDetrFrozenBatchNorm2d()
(act1): ReLU(inplace=True)
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): ConditionalDetrFrozenBatchNorm2d()
(drop_block): Identity()
(act2): ReLU(inplace=True)
(aa): Identity()
(conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): ConditionalDetrFrozenBatchNorm2d()
(act3): ReLU(inplace=True)
)
)
(layer2): Sequential(
(0): Bottleneck(
(conv1): Conv2d(256, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): ConditionalDetrFrozenBatchNorm2d()
(act1): ReLU(inplace=True)
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn2): ConditionalDetrFrozenBatchNorm2d()
(drop_block): Identity()
(act2): ReLU(inplace=True)
(aa): Identity()
(conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): ConditionalDetrFrozenBatchNorm2d()
(act3): ReLU(inplace=True)
(downsample): Sequential(
(0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): ConditionalDetrFrozenBatchNorm2d()
)
)
(1): Bottleneck(
(conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): ConditionalDetrFrozenBatchNorm2d()
(act1): ReLU(inplace=True)
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): ConditionalDetrFrozenBatchNorm2d()
(drop_block): Identity()
(act2): ReLU(inplace=True)
(aa): Identity()
(conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): ConditionalDetrFrozenBatchNorm2d()
(act3): ReLU(inplace=True)
)
(2): Bottleneck(
(conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): ConditionalDetrFrozenBatchNorm2d()
(act1): ReLU(inplace=True)
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): ConditionalDetrFrozenBatchNorm2d()
(drop_block): Identity()
(act2): ReLU(inplace=True)
(aa): Identity()
(conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): ConditionalDetrFrozenBatchNorm2d()
(act3): ReLU(inplace=True)
)
(3): Bottleneck(
(conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): ConditionalDetrFrozenBatchNorm2d()
(act1): ReLU(inplace=True)
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): ConditionalDetrFrozenBatchNorm2d()
(drop_block): Identity()
(act2): ReLU(inplace=True)
(aa): Identity()
(conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): ConditionalDetrFrozenBatchNorm2d()
(act3): ReLU(inplace=True)
)
)
(layer3): Sequential(
(0): Bottleneck(
(conv1): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): ConditionalDetrFrozenBatchNorm2d()
(act1): ReLU(inplace=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn2): ConditionalDetrFrozenBatchNorm2d()
(drop_block): Identity()
(act2): ReLU(inplace=True)
(aa): Identity()
(conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): ConditionalDetrFrozenBatchNorm2d()
(act3): ReLU(inplace=True)
(downsample): Sequential(
(0): Conv2d(512, 1024, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): ConditionalDetrFrozenBatchNorm2d()
)
)
(1): Bottleneck(
(conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): ConditionalDetrFrozenBatchNorm2d()
(act1): ReLU(inplace=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): ConditionalDetrFrozenBatchNorm2d()
(drop_block): Identity()
(act2): ReLU(inplace=True)
(aa): Identity()
(conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): ConditionalDetrFrozenBatchNorm2d()
(act3): ReLU(inplace=True)
)
(2): Bottleneck(
(conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): ConditionalDetrFrozenBatchNorm2d()
(act1): ReLU(inplace=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): ConditionalDetrFrozenBatchNorm2d()
(drop_block): Identity()
(act2): ReLU(inplace=True)
(aa): Identity()
(conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): ConditionalDetrFrozenBatchNorm2d()
(act3): ReLU(inplace=True)
)
(3): Bottleneck(
(conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): ConditionalDetrFrozenBatchNorm2d()
(act1): ReLU(inplace=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): ConditionalDetrFrozenBatchNorm2d()
(drop_block): Identity()
(act2): ReLU(inplace=True)
(aa): Identity()
(conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): ConditionalDetrFrozenBatchNorm2d()
(act3): ReLU(inplace=True)
)
(4): Bottleneck(
(conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): ConditionalDetrFrozenBatchNorm2d()
(act1): ReLU(inplace=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): ConditionalDetrFrozenBatchNorm2d()
(drop_block): Identity()
(act2): ReLU(inplace=True)
(aa): Identity()
(conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): ConditionalDetrFrozenBatchNorm2d()
(act3): ReLU(inplace=True)
)
(5): Bottleneck(
(conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): ConditionalDetrFrozenBatchNorm2d()
(act1): ReLU(inplace=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): ConditionalDetrFrozenBatchNorm2d()
(drop_block): Identity()
(act2): ReLU(inplace=True)
(aa): Identity()
(conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): ConditionalDetrFrozenBatchNorm2d()
(act3): ReLU(inplace=True)
)
)
(layer4): Sequential(
(0): Bottleneck(
(conv1): Conv2d(1024, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): ConditionalDetrFrozenBatchNorm2d()
(act1): ReLU(inplace=True)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn2): ConditionalDetrFrozenBatchNorm2d()
(drop_block): Identity()
(act2): ReLU(inplace=True)
(aa): Identity()
(conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): ConditionalDetrFrozenBatchNorm2d()
(act3): ReLU(inplace=True)
(downsample): Sequential(
(0): Conv2d(1024, 2048, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): ConditionalDetrFrozenBatchNorm2d()
)
)
(1): Bottleneck(
(conv1): Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): ConditionalDetrFrozenBatchNorm2d()
(act1): ReLU(inplace=True)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): ConditionalDetrFrozenBatchNorm2d()
(drop_block): Identity()
(act2): ReLU(inplace=True)
(aa): Identity()
(conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): ConditionalDetrFrozenBatchNorm2d()
(act3): ReLU(inplace=True)
)
(2): Bottleneck(
(conv1): Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): ConditionalDetrFrozenBatchNorm2d()
(act1): ReLU(inplace=True)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): ConditionalDetrFrozenBatchNorm2d()
(drop_block): Identity()
(act2): ReLU(inplace=True)
(aa): Identity()
(conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): ConditionalDetrFrozenBatchNorm2d()
(act3): ReLU(inplace=True)
)
)
)
)
(position_embedding): ConditionalDetrSinePositionEmbedding()
)
(input_projection): Conv2d(2048, 256, kernel_size=(1, 1), stride=(1, 1))
(query_position_embeddings): Embedding(300, 256)
(encoder): ConditionalDetrEncoder(
(layers): ModuleList(
(0-5): 6 x ConditionalDetrEncoderLayer(
(self_attn): DetrAttention(
(k_proj): Linear(in_features=256, out_features=256, bias=True)
(v_proj): Linear(in_features=256, out_features=256, bias=True)
(q_proj): Linear(in_features=256, out_features=256, bias=True)
(out_proj): Linear(in_features=256, out_features=256, bias=True)
)
(self_attn_layer_norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(activation_fn): ReLU()
(fc1): Linear(in_features=256, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=256, bias=True)
(final_layer_norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
)
)
(decoder): ConditionalDetrDecoder(
(layers): ModuleList(
(0): ConditionalDetrDecoderLayer(
(sa_qcontent_proj): Linear(in_features=256, out_features=256, bias=True)
(sa_qpos_proj): Linear(in_features=256, out_features=256, bias=True)
(sa_kcontent_proj): Linear(in_features=256, out_features=256, bias=True)
(sa_kpos_proj): Linear(in_features=256, out_features=256, bias=True)
(sa_v_proj): Linear(in_features=256, out_features=256, bias=True)
(self_attn): ConditionalDetrAttention(
(out_proj): Linear(in_features=256, out_features=256, bias=True)
)
(activation_fn): ReLU()
(self_attn_layer_norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(ca_qcontent_proj): Linear(in_features=256, out_features=256, bias=True)
(ca_qpos_proj): Linear(in_features=256, out_features=256, bias=True)
(ca_kcontent_proj): Linear(in_features=256, out_features=256, bias=True)
(ca_kpos_proj): Linear(in_features=256, out_features=256, bias=True)
(ca_v_proj): Linear(in_features=256, out_features=256, bias=True)
(ca_qpos_sine_proj): Linear(in_features=256, out_features=256, bias=True)
(encoder_attn): ConditionalDetrAttention(
(out_proj): Linear(in_features=256, out_features=256, bias=True)
)
(encoder_attn_layer_norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=256, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=256, bias=True)
(final_layer_norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
(1-5): 5 x ConditionalDetrDecoderLayer(
(sa_qcontent_proj): Linear(in_features=256, out_features=256, bias=True)
(sa_qpos_proj): Linear(in_features=256, out_features=256, bias=True)
(sa_kcontent_proj): Linear(in_features=256, out_features=256, bias=True)
(sa_kpos_proj): Linear(in_features=256, out_features=256, bias=True)
(sa_v_proj): Linear(in_features=256, out_features=256, bias=True)
(self_attn): ConditionalDetrAttention(
(out_proj): Linear(in_features=256, out_features=256, bias=True)
)
(activation_fn): ReLU()
(self_attn_layer_norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(ca_qcontent_proj): Linear(in_features=256, out_features=256, bias=True)
(ca_qpos_proj): None
(ca_kcontent_proj): Linear(in_features=256, out_features=256, bias=True)
(ca_kpos_proj): Linear(in_features=256, out_features=256, bias=True)
(ca_v_proj): Linear(in_features=256, out_features=256, bias=True)
(ca_qpos_sine_proj): Linear(in_features=256, out_features=256, bias=True)
(encoder_attn): ConditionalDetrAttention(
(out_proj): Linear(in_features=256, out_features=256, bias=True)
)
(encoder_attn_layer_norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=256, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=256, bias=True)
(final_layer_norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
)
(layernorm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(query_scale): MLP(
(layers): ModuleList(
(0-1): 2 x Linear(in_features=256, out_features=256, bias=True)
)
)
(ref_point_head): MLP(
(layers): ModuleList(
(0): Linear(in_features=256, out_features=256, bias=True)
(1): Linear(in_features=256, out_features=2, bias=True)
)
)
)
)
(class_labels_classifier): Linear(in_features=256, out_features=7, bias=True)
(bbox_predictor): ConditionalDetrMLPPredictionHead(
(layers): ModuleList(
(0-1): 2 x Linear(in_features=256, out_features=256, bias=True)
(2): Linear(in_features=256, out_features=4, bias=True)
)
)
)
18.3 tk - Setup and visualize transforms (augmentations)
- TK - explain simple augmentations:
- RandomHorizontalFlip
- ColorJitter
- That’s it…
- Tailor the data augmentations to your own dataset/problem
import torch
import torchvision
from torchvision.transforms import v2
from torchvision.transforms.v2.functional import to_pil_image, pil_to_tensor, pad
from torchvision.utils import draw_bounding_boxes
# Optional transform from here: https://arxiv.org/pdf/2012.07177
# Scale jitter -> pad -> resize
= v2.Compose([
train_transforms
v2.ToImage(),# v2.RandomResizedCrop(size=(640, 640), antialias=True),
# v2.Resize(size=(640, 640)),
# v2.RandomShortestSize(min_size=480, max_size=640),
# v2.ScaleJitter(target_size=(640, 640)),
# PadToSize(target_height=640, target_width=640),
=0.5),
v2.RandomHorizontalFlip(p# v2.RandomPhotometricDistort(p=0.75),
# v2.RandomShortestSize(min_size=480, max_size=640),
# v2.Resize(size=(640, 640)),
=0.75, # randomly adjust the brightness
v2.ColorJitter(brightness=0.75), # randomly alter the contrast
contrast# v2.RandomPerspective(distortion_scale=0.3,
# p=0.3,
# fill=(123, 117, 104)), # fill with average colour
# v2.RandomZoomOut(side_range=(1.0, 1.5),
# fill=(123, 117, 104)),
=torch.float32, scale=True),
v2.ToDtype(dtype
# v2.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
# sanitize boxes, recommended to be called at least once at the end of the transform pipeline
# https://pytorch.org/vision/stable/generated/torchvision.transforms.v2.SanitizeBoundingBoxes.html#torchvision.transforms.v2.SanitizeBoundingBoxes
=None)
v2.SanitizeBoundingBoxes(labels_getter ])
18.4 TK - Visualize transforms
import random
= random.randint(0, len(dataset["train"]))
random_idx = dataset["train"][random_idx]
random_sample
# Perform transform on image
= random_sample["image"]
random_sample_image = random_sample["image"].size
random_sample_image_width, random_sample_image_height = random_sample["annotations"]["bbox"] # these are in XYWH format
random_sample_boxes_xywh = torchvision.ops.box_convert(boxes=torch.tensor(random_sample_boxes_xywh),
random_sample_boxes_xyxy ="xywh",
in_fmt="xyxy")
out_fmt
# Format boxes to be xyxy for transforms
= torchvision.tv_tensors.BoundingBoxes(
random_sample_boxes_xyxy =random_sample_boxes_xyxy,
dataformat="XYXY",
=(random_sample_image_height, random_sample_image_width) # comes in the form height, width
canvas_size
)
= train_transforms(random_sample_image,
random_sample_image_transformed, random_sample_boxes_transformed random_sample_boxes_xyxy)
= to_pil_image(pic=draw_bounding_boxes(
random_sample_original_image_with_boxes =pil_to_tensor(pic=random_sample_image),
image=random_sample_boxes_xyxy,
boxes=None,
labels=3))
width= (random_sample_original_image_with_boxes.size[1], random_sample_original_image_with_boxes.size[0])
random_sample_original_image_with_boxes_size
# Plot the predicted boxes on the random test image
= to_pil_image(pic=draw_bounding_boxes(
random_sample_transformed_image_with_boxes =random_sample_image_transformed,
image=random_sample_boxes_transformed,
boxes=None,
labels=3))
width= (random_sample_transformed_image_with_boxes.size[1], random_sample_transformed_image_with_boxes.size[0])
random_sample_transformed_image_with_boxes_size
# Visualize the transformed image
import matplotlib.pyplot as plt
# Create a figure with two subplots
= plt.subplots(1, 2, figsize=(10, 5))
fig, axes
# Display image 1
0].imshow(random_sample_original_image_with_boxes)
axes[0].axis("off") # Hide axes
axes[0].set_title(f"Original Image | Size: {random_sample_original_image_with_boxes_size} (hxw)")
axes[
# Display image 2
1].imshow(random_sample_transformed_image_with_boxes)
axes[1].axis("off") # Hide axes
axes[1].set_title(f"Transformed Image | Size: {random_sample_transformed_image_with_boxes_size} (hxw)")
axes[
# Show the plot
plt.tight_layout(); plt.show()
18.5 TK - Create function to preprocess and transform batch of examples
from torchvision import tv_tensors
def preprocess_and_transform_batch(examples,
image_processor,=None # Note: Could optionally add transforms (e.g. data augmentation) here
transforms
):"""
Function to preprocess batches of data.
Can optionally apply a transform later on.
"""
= []
images
= []
coco_annotations
for image, image_id, annotations_dict in zip(examples["image"], examples["image_id"], examples["annotations"]):
# Note: may need to open image if it is an image path rather than PIL.Image
= annotations_dict["bbox"]
bbox_list = annotations_dict["category_id"]
category_list = annotations_dict["area"]
area_list
# Note: Could optionally apply a transform here.
if transforms:
= image.size[0], image.size[1]
width, height = tv_tensors.BoundingBoxes(data=torch.tensor(bbox_list),
bbox_list format="XYWH",
=(height, width)) # canvas_size = height, width
canvas_size= transforms(image,
image, bbox_list
bbox_list)
# Format the annotations into COCO format
= format_image_annotations_as_coco(image_id=image_id,
cooc_format_annotations =category_list,
categories=area_list,
areas=bbox_list)
bboxes
# Add images/annotations to their respective lists
images.append(image)
coco_annotations.append(cooc_format_annotations)
# Apply the image processor to lists of images and annotations
= image_processor.preprocess(images=images,
preprocessed_batch =coco_annotations,
annotations="pt",
return_tensors=False if transforms else True,
do_rescale=True,
do_resize=True)
do_pad
return preprocessed_batch
from functools import partial
# Make a transform for different splits
= partial(
train_transform_batch
preprocess_and_transform_batch,=train_transforms,
transforms=image_processor
image_processor
)
= partial(
validation_transform_batch
preprocess_and_transform_batch,=None,
transforms=image_processor
image_processor )
= dataset.copy()
processed_dataset "train"] = dataset["train"].with_transform(train_transform_batch)
processed_dataset["validation"] = dataset["validation"].with_transform(validation_transform_batch)
processed_dataset["test"] = dataset["test"].with_transform(validation_transform_batch) processed_dataset[
# Create data_collate_function to collect samples into batches
# TK - want to get a dictionary of {"pixel_mask": [batch_of_samples], "labels": [batch_of_samples], "pixel_mask": [batch_of_samples]}
def data_collate_function(batch):
= {}
collated_data
# Stack together a collection of pixel_values tensors
"pixel_values"] = torch.stack([sample["pixel_values"] for sample in batch])
collated_data[
# Get the labels (these are dictionaries so no need to use torch.stack)
"labels"] = [sample["labels"] for sample in batch]
collated_data[
# If there is a pixel_mask key, return the pixel_mask's as well
if "pixel_mask" in batch[0]:
"pixel_mask"] = torch.stack([sample["pixel_mask"] for sample in batch])
collated_data[
return collated_data
= create_model()
model_aug model_aug
Some weights of ConditionalDetrForObjectDetection were not initialized from the model checkpoint at microsoft/conditional-detr-resnet-50 and are newly initialized because the shapes did not match:
- class_labels_classifier.bias: found shape torch.Size([91]) in the checkpoint and torch.Size([7]) in the model instantiated
- class_labels_classifier.weight: found shape torch.Size([91, 256]) in the checkpoint and torch.Size([7, 256]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
ConditionalDetrForObjectDetection(
(model): ConditionalDetrModel(
(backbone): ConditionalDetrConvModel(
(conv_encoder): ConditionalDetrConvEncoder(
(model): FeatureListNet(
(conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
(bn1): ConditionalDetrFrozenBatchNorm2d()
(act1): ReLU(inplace=True)
(maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
(layer1): Sequential(
(0): Bottleneck(
(conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): ConditionalDetrFrozenBatchNorm2d()
(act1): ReLU(inplace=True)
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): ConditionalDetrFrozenBatchNorm2d()
(drop_block): Identity()
(act2): ReLU(inplace=True)
(aa): Identity()
(conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): ConditionalDetrFrozenBatchNorm2d()
(act3): ReLU(inplace=True)
(downsample): Sequential(
(0): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): ConditionalDetrFrozenBatchNorm2d()
)
)
(1): Bottleneck(
(conv1): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): ConditionalDetrFrozenBatchNorm2d()
(act1): ReLU(inplace=True)
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): ConditionalDetrFrozenBatchNorm2d()
(drop_block): Identity()
(act2): ReLU(inplace=True)
(aa): Identity()
(conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): ConditionalDetrFrozenBatchNorm2d()
(act3): ReLU(inplace=True)
)
(2): Bottleneck(
(conv1): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): ConditionalDetrFrozenBatchNorm2d()
(act1): ReLU(inplace=True)
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): ConditionalDetrFrozenBatchNorm2d()
(drop_block): Identity()
(act2): ReLU(inplace=True)
(aa): Identity()
(conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): ConditionalDetrFrozenBatchNorm2d()
(act3): ReLU(inplace=True)
)
)
(layer2): Sequential(
(0): Bottleneck(
(conv1): Conv2d(256, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): ConditionalDetrFrozenBatchNorm2d()
(act1): ReLU(inplace=True)
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn2): ConditionalDetrFrozenBatchNorm2d()
(drop_block): Identity()
(act2): ReLU(inplace=True)
(aa): Identity()
(conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): ConditionalDetrFrozenBatchNorm2d()
(act3): ReLU(inplace=True)
(downsample): Sequential(
(0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): ConditionalDetrFrozenBatchNorm2d()
)
)
(1): Bottleneck(
(conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): ConditionalDetrFrozenBatchNorm2d()
(act1): ReLU(inplace=True)
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): ConditionalDetrFrozenBatchNorm2d()
(drop_block): Identity()
(act2): ReLU(inplace=True)
(aa): Identity()
(conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): ConditionalDetrFrozenBatchNorm2d()
(act3): ReLU(inplace=True)
)
(2): Bottleneck(
(conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): ConditionalDetrFrozenBatchNorm2d()
(act1): ReLU(inplace=True)
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): ConditionalDetrFrozenBatchNorm2d()
(drop_block): Identity()
(act2): ReLU(inplace=True)
(aa): Identity()
(conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): ConditionalDetrFrozenBatchNorm2d()
(act3): ReLU(inplace=True)
)
(3): Bottleneck(
(conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): ConditionalDetrFrozenBatchNorm2d()
(act1): ReLU(inplace=True)
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): ConditionalDetrFrozenBatchNorm2d()
(drop_block): Identity()
(act2): ReLU(inplace=True)
(aa): Identity()
(conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): ConditionalDetrFrozenBatchNorm2d()
(act3): ReLU(inplace=True)
)
)
(layer3): Sequential(
(0): Bottleneck(
(conv1): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): ConditionalDetrFrozenBatchNorm2d()
(act1): ReLU(inplace=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn2): ConditionalDetrFrozenBatchNorm2d()
(drop_block): Identity()
(act2): ReLU(inplace=True)
(aa): Identity()
(conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): ConditionalDetrFrozenBatchNorm2d()
(act3): ReLU(inplace=True)
(downsample): Sequential(
(0): Conv2d(512, 1024, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): ConditionalDetrFrozenBatchNorm2d()
)
)
(1): Bottleneck(
(conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): ConditionalDetrFrozenBatchNorm2d()
(act1): ReLU(inplace=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): ConditionalDetrFrozenBatchNorm2d()
(drop_block): Identity()
(act2): ReLU(inplace=True)
(aa): Identity()
(conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): ConditionalDetrFrozenBatchNorm2d()
(act3): ReLU(inplace=True)
)
(2): Bottleneck(
(conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): ConditionalDetrFrozenBatchNorm2d()
(act1): ReLU(inplace=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): ConditionalDetrFrozenBatchNorm2d()
(drop_block): Identity()
(act2): ReLU(inplace=True)
(aa): Identity()
(conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): ConditionalDetrFrozenBatchNorm2d()
(act3): ReLU(inplace=True)
)
(3): Bottleneck(
(conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): ConditionalDetrFrozenBatchNorm2d()
(act1): ReLU(inplace=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): ConditionalDetrFrozenBatchNorm2d()
(drop_block): Identity()
(act2): ReLU(inplace=True)
(aa): Identity()
(conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): ConditionalDetrFrozenBatchNorm2d()
(act3): ReLU(inplace=True)
)
(4): Bottleneck(
(conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): ConditionalDetrFrozenBatchNorm2d()
(act1): ReLU(inplace=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): ConditionalDetrFrozenBatchNorm2d()
(drop_block): Identity()
(act2): ReLU(inplace=True)
(aa): Identity()
(conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): ConditionalDetrFrozenBatchNorm2d()
(act3): ReLU(inplace=True)
)
(5): Bottleneck(
(conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): ConditionalDetrFrozenBatchNorm2d()
(act1): ReLU(inplace=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): ConditionalDetrFrozenBatchNorm2d()
(drop_block): Identity()
(act2): ReLU(inplace=True)
(aa): Identity()
(conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): ConditionalDetrFrozenBatchNorm2d()
(act3): ReLU(inplace=True)
)
)
(layer4): Sequential(
(0): Bottleneck(
(conv1): Conv2d(1024, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): ConditionalDetrFrozenBatchNorm2d()
(act1): ReLU(inplace=True)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn2): ConditionalDetrFrozenBatchNorm2d()
(drop_block): Identity()
(act2): ReLU(inplace=True)
(aa): Identity()
(conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): ConditionalDetrFrozenBatchNorm2d()
(act3): ReLU(inplace=True)
(downsample): Sequential(
(0): Conv2d(1024, 2048, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): ConditionalDetrFrozenBatchNorm2d()
)
)
(1): Bottleneck(
(conv1): Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): ConditionalDetrFrozenBatchNorm2d()
(act1): ReLU(inplace=True)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): ConditionalDetrFrozenBatchNorm2d()
(drop_block): Identity()
(act2): ReLU(inplace=True)
(aa): Identity()
(conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): ConditionalDetrFrozenBatchNorm2d()
(act3): ReLU(inplace=True)
)
(2): Bottleneck(
(conv1): Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): ConditionalDetrFrozenBatchNorm2d()
(act1): ReLU(inplace=True)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): ConditionalDetrFrozenBatchNorm2d()
(drop_block): Identity()
(act2): ReLU(inplace=True)
(aa): Identity()
(conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): ConditionalDetrFrozenBatchNorm2d()
(act3): ReLU(inplace=True)
)
)
)
)
(position_embedding): ConditionalDetrSinePositionEmbedding()
)
(input_projection): Conv2d(2048, 256, kernel_size=(1, 1), stride=(1, 1))
(query_position_embeddings): Embedding(300, 256)
(encoder): ConditionalDetrEncoder(
(layers): ModuleList(
(0-5): 6 x ConditionalDetrEncoderLayer(
(self_attn): DetrAttention(
(k_proj): Linear(in_features=256, out_features=256, bias=True)
(v_proj): Linear(in_features=256, out_features=256, bias=True)
(q_proj): Linear(in_features=256, out_features=256, bias=True)
(out_proj): Linear(in_features=256, out_features=256, bias=True)
)
(self_attn_layer_norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(activation_fn): ReLU()
(fc1): Linear(in_features=256, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=256, bias=True)
(final_layer_norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
)
)
(decoder): ConditionalDetrDecoder(
(layers): ModuleList(
(0): ConditionalDetrDecoderLayer(
(sa_qcontent_proj): Linear(in_features=256, out_features=256, bias=True)
(sa_qpos_proj): Linear(in_features=256, out_features=256, bias=True)
(sa_kcontent_proj): Linear(in_features=256, out_features=256, bias=True)
(sa_kpos_proj): Linear(in_features=256, out_features=256, bias=True)
(sa_v_proj): Linear(in_features=256, out_features=256, bias=True)
(self_attn): ConditionalDetrAttention(
(out_proj): Linear(in_features=256, out_features=256, bias=True)
)
(activation_fn): ReLU()
(self_attn_layer_norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(ca_qcontent_proj): Linear(in_features=256, out_features=256, bias=True)
(ca_qpos_proj): Linear(in_features=256, out_features=256, bias=True)
(ca_kcontent_proj): Linear(in_features=256, out_features=256, bias=True)
(ca_kpos_proj): Linear(in_features=256, out_features=256, bias=True)
(ca_v_proj): Linear(in_features=256, out_features=256, bias=True)
(ca_qpos_sine_proj): Linear(in_features=256, out_features=256, bias=True)
(encoder_attn): ConditionalDetrAttention(
(out_proj): Linear(in_features=256, out_features=256, bias=True)
)
(encoder_attn_layer_norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=256, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=256, bias=True)
(final_layer_norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
(1-5): 5 x ConditionalDetrDecoderLayer(
(sa_qcontent_proj): Linear(in_features=256, out_features=256, bias=True)
(sa_qpos_proj): Linear(in_features=256, out_features=256, bias=True)
(sa_kcontent_proj): Linear(in_features=256, out_features=256, bias=True)
(sa_kpos_proj): Linear(in_features=256, out_features=256, bias=True)
(sa_v_proj): Linear(in_features=256, out_features=256, bias=True)
(self_attn): ConditionalDetrAttention(
(out_proj): Linear(in_features=256, out_features=256, bias=True)
)
(activation_fn): ReLU()
(self_attn_layer_norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(ca_qcontent_proj): Linear(in_features=256, out_features=256, bias=True)
(ca_qpos_proj): None
(ca_kcontent_proj): Linear(in_features=256, out_features=256, bias=True)
(ca_kpos_proj): Linear(in_features=256, out_features=256, bias=True)
(ca_v_proj): Linear(in_features=256, out_features=256, bias=True)
(ca_qpos_sine_proj): Linear(in_features=256, out_features=256, bias=True)
(encoder_attn): ConditionalDetrAttention(
(out_proj): Linear(in_features=256, out_features=256, bias=True)
)
(encoder_attn_layer_norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=256, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=256, bias=True)
(final_layer_norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
)
(layernorm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(query_scale): MLP(
(layers): ModuleList(
(0-1): 2 x Linear(in_features=256, out_features=256, bias=True)
)
)
(ref_point_head): MLP(
(layers): ModuleList(
(0): Linear(in_features=256, out_features=256, bias=True)
(1): Linear(in_features=256, out_features=2, bias=True)
)
)
)
)
(class_labels_classifier): Linear(in_features=256, out_features=7, bias=True)
(bbox_predictor): ConditionalDetrMLPPredictionHead(
(layers): ModuleList(
(0-1): 2 x Linear(in_features=256, out_features=256, bias=True)
(2): Linear(in_features=256, out_features=4, bias=True)
)
)
)
# Note: Depending on the size/speed of your GPU, this may take a while
from transformers import TrainingArguments, Trainer
# Set the batch size according to the memory you have available on your GPU
# e.g. on my NVIDIA RTX 4090 with 24GB of VRAM, I can use a batch size of 32 without running out of memory
= 16
BATCH_SIZE
# Disable warnings about `max_size` parameter being deprecated (this is okay)
import warnings
"ignore", message="The `max_size` parameter is deprecated*")
warnings.filterwarnings(
# Note: AdamW Optimizer is used by default
= TrainingArguments(
training_args ="detr_finetuned_trashify_box_detector_with_data_aug", # Tk - make sure this is suitable for data aug model
output_dir=25,
num_train_epochs=True,
fp16=BATCH_SIZE,
per_device_train_batch_size=BATCH_SIZE,
per_device_eval_batch_size=1e-4,
learning_rate="linear", # default = "linear", can try others such as "cosine", "constant" etc
lr_scheduler_type=1e-4,
weight_decay=0.01,
max_grad_norm="eval_loss",
metric_for_best_model=False,
greater_is_better="epoch",
eval_strategy="epoch",
save_strategy="epoch",
logging_strategy=2,
save_total_limit=False,
remove_unused_columns="none", # don't save experiments to a third party service
report_to=4,
dataloader_num_workers=0.05,
warmup_ratio=False,
push_to_hub=False
eval_do_concat_batches
)
= Trainer(
model_v2_trainer =model_aug,
model=training_args,
args=processed_dataset["train"],
train_dataset=processed_dataset["validation"],
eval_dataset=image_processor,
tokenizer=data_collate_function,
data_collator# compute_metrics=None # TODO: add a metrics function, just see if model trains first
)
= model_v2_trainer.train() model_v2_results
/home/daniel/miniconda3/envs/ai/lib/python3.11/site-packages/accelerate/accelerator.py:488: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.
self.scaler = torch.cuda.amp.GradScaler(**kwargs)
Epoch | Training Loss | Validation Loss |
---|---|---|
1 | 100.473500 | 8.029722 |
2 | 4.369000 | 2.737582 |
3 | 2.551800 | 2.183892 |
4 | 2.222600 | 1.922801 |
5 | 1.990600 | 1.740759 |
6 | 1.821900 | 1.557272 |
7 | 1.697400 | 1.477890 |
8 | 1.602700 | 1.451024 |
9 | 1.551700 | 1.371128 |
10 | 1.449100 | 1.317680 |
11 | 1.433500 | 1.281066 |
12 | 1.364500 | 1.247493 |
13 | 1.331400 | 1.206003 |
14 | 1.297300 | 1.187397 |
15 | 1.250600 | 1.179421 |
16 | 1.231900 | 1.165661 |
17 | 1.147900 | 1.129974 |
18 | 1.146600 | 1.117911 |
19 | 1.113800 | 1.109535 |
20 | 1.115300 | 1.096120 |
21 | 1.089400 | 1.078995 |
22 | 1.069100 | 1.087004 |
23 | 1.061900 | 1.080366 |
24 | 1.045900 | 1.071728 |
25 | 1.036300 | 1.070385 |
TK - Note: You might get the following issue (negative bounding box coordinate predictions), can try again for more stable predictions (predictions are inherently random to begin with) or use a learning rate warmup to help stabilize predictions:
ValueError: boxes1 must be in [x0, y0, x1, y1] (corner) format, but got tensor([[ 0.5796, 0.5566, 0.9956, 0.9492], [ 0.5718, 0.0610, 0.7202, 0.1738], [ 0.8218, 0.5107, 0.9878, 0.6289], …, [ 0.1379, 0.1403, 0.6709, 0.6138], [ 0.7471, 0.4319, 1.0088, 0.5864], [-0.0660, 0.2052, 0.2067, 0.5107]], device=‘cuda:0’, dtype=torch.float16)
18.6 TK - Save the trained model
# Save the model
from datetime import datetime
# TODO: update this save path so we know when the model was saved and what its parameters were
= training_args.num_train_epochs
training_epochs_ = "{:.0e}".format(training_args.learning_rate)
learning_rate_
= f"models/learn_hf_microsoft_detr_finetuned_trashify_box_dataset_only_manual_data_with_aug_{training_epochs_}_epochs_lr_{learning_rate_}"
model_v2_save_path print(f"[INFO] Saving model to: {model_v2_save_path}")
model_v2_trainer.save_model(model_v2_save_path)
[INFO] Saving model to: models/learn_hf_microsoft_detr_finetuned_trashify_box_dataset_only_manual_data_with_aug_25_epochs_lr_1e-04
19 TK - Upload Augmentation Model to Hugging Face Hub
# Push the model to the Hugging Face Hub
# TK Note: This will require you to have your Hugging Face account setup (e.g. see the setup guide, tk - link to setup guide)
# TK - this will push to the parameter `output_dir="detr_finetuned_trashify_box_detector_with_data_aug"`
="upload trashify object detection model with data augmentation"
model_v2_trainer.push_to_hub(commit_message# token=None, # Optional to add token manually
)
CommitInfo(commit_url='https://huggingface.co/mrdbourke/detr_finetuned_trashify_box_detector_with_data_aug/commit/2f5f3ed0a205b13ddf2a0e3b76120412e33b0861', commit_message='upload trashify object detection model with data augmentation', commit_description='', oid='2f5f3ed0a205b13ddf2a0e3b76120412e33b0861', pr_url=None, repo_url=RepoUrl('https://huggingface.co/mrdbourke/detr_finetuned_trashify_box_detector_with_data_aug', endpoint='https://huggingface.co', repo_type='model', repo_id='mrdbourke/detr_finetuned_trashify_box_detector_with_data_aug'), pr_revision=None, pr_num=None)
20 TK - Compare results of different models
UPTOHERE - Showcase model 2 doing better because of augmentation (harder to learn)
- TK - Compare v1 model to v2
- TK - Get model_v1 results into a variable and save it for later
- Compare both of these as plots against each other, e.g. have the training curves for aug/no_aug on one plot and the curves for validation data for aug/no_aug on another plot
- TK - offer extensions to improve the model
- TK - training model for longer, potentially using synthetic data…?
- TK - could I use 1000 high quality synthetic data samples to improve our model?
- TK - try use a different learning rate
- TK - training model for longer, potentially using synthetic data…?
# TK - Turn this workflow into a function e.g. def get_history_from_trainer() -> df/dict of history
def get_history_metrics_from_trainer(trainer):
= trainer.state.log_history
trainer_history = trainer_history[:-1] # get everything except the training time metrics (we've seen these already)
trainer_history_metrics = trainer_history[-1]
trainer_history_training_time
= [item["loss"] for item in trainer_history_metrics if "loss" in item.keys()]
model_train_loss = [item["eval_loss"] for item in trainer_history_metrics if "eval_loss" in item.keys()]
model_eval_loss = [item["learning_rate"] for item in trainer_history_metrics if "learning_rate" in item.keys()]
model_learning_rate
return model_train_loss, model_eval_loss, model_learning_rate, trainer_history_training_time
= get_history_metrics_from_trainer(trainer=model_v1_trainer)
model_v1_train_loss, model_v1_eval_loss, model_v1_learning_rate, _ = get_history_metrics_from_trainer(trainer=model_v2_trainer) model_v2_train_loss, model_v2_eval_loss, model_v2_learning_rate, _
import matplotlib.pyplot as plt
# Plot model loss curves against each other for same model
# Note: Start from index 1 onwards to remove large loss spike at beginning of training
= plt.subplots(nrows=1, ncols=2, figsize=(15, 6))
fig, ax 0].plot(model_v1_train_loss[1:], label="Model V1 Train Loss")
ax[0].plot(model_v1_eval_loss[1:], label="Model V1 Eval Loss")
ax[0].set_title("Model V1 Loss Curves")
ax[0].set_ylabel("Loss")
ax[0].set_xlabel("Epoch")
ax[0].legend()
ax[
1].plot(model_v2_train_loss[1:], label="Model V2 Train Loss")
ax[1].plot(model_v2_eval_loss[1:], label="Model V2 Eval Loss")
ax[1].set_title("Model V2 Loss Curves")
ax[1].set_ylabel("Loss")
ax[1].set_xlabel("Epoch")
ax[1].legend(); ax[
tk - notice the overfitting begin to happen with model v1 (no data augmentation) but model v2 has less overfitting and achieves a lower validation loss
import matplotlib.pyplot as plt
="Model V1")
plt.plot(model_v1_learning_rate, label="Model V2")
plt.plot(model_v2_learning_rate, label"Model Learning Rate vs. Epoch")
plt.title("Learning Rate")
plt.ylabel("Epoch")
plt.xlabel(; plt.legend()
# Plot loss values against each other
import matplotlib.pyplot as plt
= plt.subplots(nrows=1, ncols=2, figsize=(15, 6))
fig, ax = range(0, len(model_v1_train_loss))
num_epochs 0].plot(model_v1_train_loss[1:], label="Model 1 Training Loss")
ax[0].plot(model_v2_train_loss[1:], label="Model 2 Training Loss")
ax[0].set_title("Model Training Loss Curves")
ax[0].set_ylabel("Training Loss")
ax[0].set_xlabel("Epochs")
ax[0].legend()
ax[
1].plot(model_v1_eval_loss[1:], label="Model 1 Eval Loss")
ax[1].plot(model_v2_eval_loss[1:], label="Model 2 Eval Loss")
ax[1].set_title("Model Eval Loss Curves")
ax[1].set_ylabel("Eval Loss")
ax[1].set_xlabel("Epochs")
ax[1].legend(); ax[
tk - describe the loss curves here, model 2 curves may be higher for training loss but they really start to accelerate on the evaluation set towards the end
21 TK - Create demo with Augmentation Model
# Make directory for demo
from pathlib import Path
= Path("demos/trashify_object_detector_data_aug_model/")
trashify_data_aug_model_dir =True) trashify_data_aug_model_dir.mkdir(exist_ok
%%writefile demos/trashify_object_detector_data_aug_model/README.md
---
title: Trashify Demo V2 🚮
emoji: 🗑️
colorFrom: purple
colorTo: blue
sdk: gradio4.40.0
sdk_version:
app_file: app.py
pinned: false-2.0
license: apache---
# 🚮 Trashify Object Detector Demo V2
Object detection demo to detect `trash`, `bin`, `hand`, `trash_arm`, `not_trash`, `not_bin`, `not_hand`.
as example for encouraging people to cleanup their local area.
Used
all detected = +1 point.
If `trash`, `hand`, `bin`
## Dataset
-labelled dataset of people picking up trash and placing it in a bin.
All Trashify models are trained on a custom hand
as [`mrdbourke/trashify_manual_labelled_images`](https://huggingface.co/datasets/mrdbourke/trashify_manual_labelled_images).
The dataset can be found on Hugging Face
## Demos
* [V1](https://huggingface.co/spaces/mrdbourke/trashify_demo_v1) = Fine-tuned DETR model trained *without* data augmentation.
* [V2](https://huggingface.co/spaces/mrdbourke/trashify_demo_v2) = Fine-tuned DETR model trained *with* data augmentation.
* [V3](https://huggingface.co/spaces/mrdbourke/trashify_demo_v3) = Fine-tuned DETR model trained *with* data augmentation (same as V2) with an NMS (Non Maximum Suppression) post-processing step.
- finish the README.md + update with links to materials TK
Overwriting demos/trashify_object_detector_data_aug_model/README.md
%%writefile demos/trashify_object_detector_data_aug_model/requirements.txt
timm
gradio
torch transformers
Overwriting demos/trashify_object_detector_data_aug_model/requirements.txt
%%writefile demos/trashify_object_detector_data_aug_model/app.py
import gradio as gr
import torch
from PIL import Image, ImageDraw, ImageFont
from transformers import AutoImageProcessor
from transformers import AutoModelForObjectDetection
# Note: Can load from Hugging Face or can load from local.
# You will have to replace {mrdbourke} for your own username if the model is on your Hugging Face account.
= "mrdbourke/detr_finetuned_trashify_box_detector_with_data_aug"
model_save_path
# Load the model and preprocessor
= AutoImageProcessor.from_pretrained(model_save_path)
image_processor = AutoModelForObjectDetection.from_pretrained(model_save_path)
model
= "cuda" if torch.cuda.is_available() else "cpu"
device = model.to(device)
model
# Get the id2label dictionary from the model
= model.config.id2label
id2label
# Set up a colour dictionary for plotting boxes with different colours
= {
color_dict "bin": "green",
"trash": "blue",
"hand": "purple",
"trash_arm": "yellow",
"not_trash": "red",
"not_bin": "red",
"not_hand": "red",
}
# Create helper functions for seeing if items from one list are in another
def any_in_list(list_a, list_b):
"Returns True if any item from list_a is in list_b, otherwise False."
return any(item in list_b for item in list_a)
def all_in_list(list_a, list_b):
"Returns True if all items from list_a are in list_b, otherwise False."
return all(item in list_b for item in list_a)
def predict_on_image(image, conf_threshold):
with torch.no_grad():
= image_processor(images=[image], return_tensors="pt")
inputs = model(**inputs.to(device))
outputs
= torch.tensor([[image.size[1], image.size[0]]]) # height, width
target_sizes
= image_processor.post_process_object_detection(outputs,
results =conf_threshold,
threshold=target_sizes)[0]
target_sizes# Return all items in results to CPU
for key, value in results.items():
try:
= value.item().cpu() # can't get scalar as .item() so add try/except block
results[key] except:
= value.cpu()
results[key]
# Can return results as plotted on a PIL image (then display the image)
= ImageDraw.Draw(image)
draw
# Get a font from ImageFont
= ImageFont.load_default(size=20)
font
# Get class names as text for print out
= []
class_name_text_labels
for box, score, label in zip(results["boxes"], results["scores"], results["labels"]):
# Create coordinates
= tuple(box.tolist())
x, y, x2, y2
# Get label_name
= id2label[label.item()]
label_name = color_dict[label_name]
targ_color
class_name_text_labels.append(label_name)
# Draw the rectangle
=(x, y, x2, y2),
draw.rectangle(xy=targ_color,
outline=3)
width
# Create a text string to display
= f"{label_name} ({round(score.item(), 3)})"
text_string_to_show
# Draw the text on the image
=(x, y),
draw.text(xy=text_string_to_show,
text="white",
fill=font)
font
# Remove the draw each time
del draw
# Setup blank string to print out
= ""
return_string
# Setup list of target items to discover
= ["trash", "bin", "hand"]
target_items
# If no items detected or trash, bin, hand not in list, return notification
if (len(class_name_text_labels) == 0) or not (any_in_list(list_a=target_items, list_b=class_name_text_labels)):
= f"No trash, bin or hand detected at confidence threshold {conf_threshold}. Try another image or lowering the confidence threshold."
return_string return image, return_string
# If there are some missing, print the ones which are missing
elif not all_in_list(list_a=target_items, list_b=class_name_text_labels):
= []
missing_items for item in target_items:
if item not in class_name_text_labels:
missing_items.append(item)= f"Detected the following items: {class_name_text_labels}. But missing the following in order to get +1: {missing_items}. If this is an error, try another image or altering the confidence threshold. Otherwise, the model may need to be updated with better data."
return_string
# If all 3 trash, bin, hand occur = + 1
if all_in_list(list_a=target_items, list_b=class_name_text_labels):
= f"+1! Found the following items: {class_name_text_labels}, thank you for cleaning up the area!"
return_string
print(return_string)
return image, return_string
# Create the interface
= gr.Interface(
demo =predict_on_image,
fn=[
inputstype="pil", label="Target Image"),
gr.Image(=0, maximum=1, value=0.25, label="Confidence Threshold")
gr.Slider(minimum
],=[
outputstype="pil", label="Image Output"),
gr.Image(="Text Output")
gr.Text(label
],="🚮 Trashify Object Detection Demo V2",
title="""Help clean up your local area! Upload an image and get +1 if there is all of the following items detected: trash, bin, hand.
description
The [model](https://huggingface.co/mrdbourke/detr_finetuned_trashify_box_detector_with_data_aug) in V2 has been trained with data augmentation preprocessing (color jitter, horizontal flipping) to improve robustness.
""",
# Examples come in the form of a list of lists, where each inner list contains elements to prefill the `inputs` parameter with
=[
examples"examples/trashify_example_1.jpeg", 0.25],
["examples/trashify_example_2.jpeg", 0.25],
["examples/trashify_example_3.jpeg", 0.25]
[
],=True
cache_examples
)
# Launch the demo
demo.launch()
Overwriting demos/trashify_object_detector_data_aug_model/app.py
# 1. Import the required methods for uploading to the Hugging Face Hub
from huggingface_hub import (
create_repo,
get_full_repo_name,# for uploading a single file (if necessary)
upload_file, # for uploading multiple files (in a folder)
upload_folder
)
# 2. Define the parameters we'd like to use for the upload
= "demos/trashify_object_detector_data_aug_model" # TK - update this path
LOCAL_DEMO_FOLDER_PATH_TO_UPLOAD = "trashify_demo_v2"
HF_TARGET_SPACE_NAME = "space" # we're creating a Hugging Face Space
HF_REPO_TYPE = "gradio"
HF_SPACE_SDK = "" # optional: set to your Hugging Face token (but I'd advise storing this as an environment variable as previously discussed)
HF_TOKEN
# 3. Create a Space repository on Hugging Face Hub
print(f"[INFO] Creating repo on Hugging Face Hub with name: {HF_TARGET_SPACE_NAME}")
create_repo(=HF_TARGET_SPACE_NAME,
repo_id# token=HF_TOKEN, # optional: set token manually (though it will be automatically recognized if it's available as an environment variable)
=HF_REPO_TYPE,
repo_type=False, # set to True if you don't want your Space to be accessible to others
private=HF_SPACE_SDK,
space_sdk=True, # set to False if you want an error to raise if the repo_id already exists
exist_ok
)
# 4. Get the full repository name (e.g. {username}/{model_id} or {username}/{space_name})
= get_full_repo_name(model_id=HF_TARGET_SPACE_NAME)
full_hf_repo_name print(f"[INFO] Full Hugging Face Hub repo name: {full_hf_repo_name}")
# 5. Upload our demo folder
print(f"[INFO] Uploading {LOCAL_DEMO_FOLDER_PATH_TO_UPLOAD} to repo: {full_hf_repo_name}")
= upload_folder(
folder_upload_url =full_hf_repo_name,
repo_id=LOCAL_DEMO_FOLDER_PATH_TO_UPLOAD,
folder_path=".", # upload our folder to the root directory ("." means "base" or "root", this is the default)
path_in_repo# token=HF_TOKEN, # optional: set token manually
=HF_REPO_TYPE,
repo_type="Uploading Trashify V2 box detection model (with data augmentation) app.py"
commit_message
)print(f"[INFO] Demo folder successfully uploaded with commit URL: {folder_upload_url}")
[INFO] Creating repo on Hugging Face Hub with name: trashify_demo_v2
[INFO] Full Hugging Face Hub repo name: mrdbourke/trashify_demo_v2
[INFO] Uploading demos/trashify_object_detector_data_aug_model to repo: mrdbourke/trashify_demo_v2
[INFO] Demo folder successfully uploaded with commit URL: https://huggingface.co/spaces/mrdbourke/trashify_demo_v2/tree/main/.
# Next:
# Upload augmentation model to Hugging Face Hub ✅
# Create demo for augmentation model ✅
# Compare results from augmentation model to non-augmentation model ✅
21.1 TK - Make a prediction on a random test sample with model using data aug model
# Get a random sample from the test preds
= random.randint(0, len(processed_dataset["test"]))
random_test_pred_index print(f"[INFO] Making predictions on test item with index: {random_test_pred_index}")
= processed_dataset["test"][random_test_pred_index]
random_test_sample
# # Do a single forward pass with the model
= model_aug(pixel_values=random_test_sample["pixel_values"].unsqueeze(0).to("cuda"), # model expects input [batch_size, color_channels, height, width]
random_test_sample_outputs =None)
pixel_mask
# Post process a random item from test preds
= image_processor.post_process_object_detection(
random_test_sample_outputs_post_processed =random_test_sample_outputs,
outputs=0.25, # prediction probability threshold for boxes (note: boxes from an untrained model will likely be bad)
threshold=[random_test_sample["labels"]["orig_size"]] # original input image size (or whichever target size you'd like), required to be same number of input items in a list
target_sizes
)
# Plot the random sample test preds
# Extract scores, labels and boxes
= random_test_sample_outputs_post_processed[0]["scores"]
random_test_sample_pred_scores = random_test_sample_outputs_post_processed[0]["labels"]
random_test_sample_pred_labels = random_test_sample_outputs_post_processed[0]["boxes"]
random_test_sample_pred_boxes
# Create a list of labels to plot on the boxes
= [f"Pred: {id2label[label_pred.item()]} ({round(score_pred.item(), 4)})"
random_test_sample_labels_to_plot for label_pred, score_pred in zip(random_test_sample_pred_labels, random_test_sample_pred_scores)]
print(f"[INFO] Labels with scores: {random_test_sample_labels_to_plot}")
print(f"[INFO] Boxes:")
for item in random_test_sample_pred_boxes:
print(item.detach().cpu())
print(f"[INFO] Total preds: {len(random_test_sample_labels_to_plot)}")
# Plot the predicted boxes on the random test image
to_pil_image(=draw_bounding_boxes(
pic=pil_to_tensor(pic=dataset["test"][random_test_pred_index]["image"]),
image=random_test_sample_pred_boxes,
boxes=random_test_sample_labels_to_plot,
labels=3
width
) )
[INFO] Making predictions on test item with index: 163
[INFO] Labels with scores: ['Pred: bin (0.6625)', 'Pred: hand (0.5412)', 'Pred: trash (0.5007)', 'Pred: trash (0.4147)', 'Pred: trash (0.396)', 'Pred: not_trash (0.3237)', 'Pred: hand (0.2799)']
[INFO] Boxes:
tensor([ 10.7812, 393.1250, 950.1562, 1160.6250])
tensor([ 149.8828, 667.9688, 471.6797, 1018.2812])
tensor([405.0000, 679.1406, 668.4375, 972.1094])
tensor([248.2031, 472.6562, 675.7031, 994.8438])
tensor([ 140.6250, 467.3438, 675.9375, 1002.6562])
tensor([ 373.2422, 896.4844, 648.6328, 1063.5156])
tensor([ 10.3125, 667.9688, 472.0312, 1264.5312])
[INFO] Total preds: 7
22 TK - Model V3 - Cleaning up predictions with NMS (Non-max Suppression)
UPTOHERE * Take preds from model v2 and perform NMS on them to see what happens * Need to calculate: * IoU (intersection over union) * Can write about these in a blog post as extension material * Test image index good to practice on: * 163, 108 * Create a demo which compares NMS-free boxes to boxes with NMS
22.1 TK - NMS filtering logic to do
TK - create a table of different items here
- Simplest filtering: keep only 1x class label with the highest score per image (e.g. if there are two “hand” predictions, keep only the highest scoring one) ✅
- TK - problem with simple filtering is that it might take out a box that would’ve been helpful, it also assumes that there’s little false positives (e.g. each box is predicting the class that it should predict)
- Greedy IoU filtering: Filter boxes which have IoU > 0.9 (big overlap) and keep the box with the higher score ✅
- TK - problem here is that it may filter heavily overlapping classes (e.g. if there are many boxes of different classes clustered together because your objects overlap, such as on a plate of food, items may overlap)
- Class-aware IoU filtering: Filter boxes which have the same label and have IoU > 0.5 and keep the box with the higher score
Other potential NMS options: * Greedy NMS (good for distinct boxes, just take the highest scoring box per class) * Soft-NMS with linear penalty (good for boxes which may have overlap, e.g. smaller boxes in clusters) * Class-aware NMS (only perform NMS on same class of boxes)
See this video here: https://youtu.be/VAo84c1hQX8?si=dYftsYADb9Kq-bul
TK - show prediction with more boxes than ideal, then introduce NMS as a technique to fix the predictions (e.g. on the same sample)
- TK - NMS doesn’t need an extra model, just a way to
TK - test index 163 is a good example with many boxes that could be shortened to a few
22.2 TK - Simple NMS - Keep only highest scoring class per prediction
TK - This is the simplest method and simply iterates through the boxes and keep the highest scoring box per class (e.g. if there are two “hand” prediction boxes, only keep the higher scoring one).
def filter_highest_scoring_box_per_class(boxes, labels, scores):
"""
Perform NMS (Non-max Supression) to only keep the top scoring box per class.
Args:
boxes: tensor of shape (N, 4)
labels: tensor of shape (N,)
scores: tensor of shape (N,)
Returns:
boxes: tensor of shape (N, 4) filtered for max scoring item per class
labels: tensor of shape (N,) filtered for max scoring item per class
scores: tensor of shape (N,) filtered for max scoring item per class
"""
# Start with a blank keep mask (e.g. all False and then update the boxes to keep with True)
= torch.zeros(len(boxes), dtype=torch.bool)
keep_mask
# For each unique class
for class_id in labels.unique():
# Get the indicies for the target class
= labels == class_id
class_mask
# If any of the labels match the current class_id
if class_mask.any():
# Find the index of highest scoring box for this specific class
= scores[class_mask]
class_scores = class_scores.argmax()
highest_score_idx
# Convert back to the original index
= torch.where(class_mask)[0][highest_score_idx]
original_idx
# Update the index in the keep mask to keep the highest scoring box
= True
keep_mask[original_idx]
return boxes[keep_mask], labels[keep_mask], scores[keep_mask]
# Mask with simple NMS keep mask
= filter_highest_scoring_box_per_class(boxes=random_test_sample_pred_boxes,
keep_boxes, keep_labels, keep_scores =random_test_sample_pred_labels,
labels=random_test_sample_pred_scores)
scores
print(len(random_test_sample_pred_boxes), len(random_test_sample_pred_labels), len(random_test_sample_pred_scores))
print(len(keep_scores), len(keep_labels), len(keep_boxes))
7 7 7
4 4 4
keep_boxes, keep_labels, keep_scores
(tensor([[ 10.7812, 393.1250, 950.1562, 1160.6250],
[ 149.8828, 667.9688, 471.6797, 1018.2812],
[ 405.0000, 679.1406, 668.4375, 972.1094],
[ 373.2422, 896.4844, 648.6328, 1063.5156]], device='cuda:0',
grad_fn=<IndexBackward0>),
tensor([0, 1, 5, 4], device='cuda:0'),
tensor([0.6625, 0.5412, 0.5007, 0.3237], device='cuda:0',
grad_fn=<IndexBackward0>))
# Create a list of labels to plot on the boxes
= [f"Pred: {id2label[label_pred.item()]} ({round(score_pred.item(), 4)})"
random_test_sample_labels_to_plot for label_pred, score_pred in zip(random_test_sample_pred_labels, random_test_sample_pred_scores)]
print(f"[INFO] Labels with scores: {random_test_sample_labels_to_plot}")
# Plot the predicted boxes on the random test image
= to_pil_image(
test_image_with_preds_original =draw_bounding_boxes(
pic=pil_to_tensor(pic=dataset["test"][random_test_pred_index]["image"]),
image=random_test_sample_pred_boxes,
boxes=random_test_sample_labels_to_plot,
labels=3
width
)
)
### Create image with filtered boxes
# Create a list of labels to plot on the boxes
= [f"Pred: {id2label[label_pred.item()]} ({round(score_pred.item(), 4)})"
random_test_sample_labels_to_plot_filtered for label_pred, score_pred in zip(keep_labels, keep_scores)]
print(f"[INFO] Labels with scores: {random_test_sample_labels_to_plot_filtered}")
# Plot the predicted boxes on the random test image
= to_pil_image(
test_image_with_preds_filtered =draw_bounding_boxes(
pic=pil_to_tensor(pic=dataset["test"][random_test_pred_index]["image"]),
image=keep_boxes,
boxes=random_test_sample_labels_to_plot_filtered,
labels=3
width
)
)
# Visualize the transformed image
import matplotlib.pyplot as plt
# Create a figure with two subplots
= plt.subplots(1, 2, figsize=(20, 10))
fig, axes
# Display image 1
0].imshow(test_image_with_preds_original)
axes[0].axis("off") # Hide axes
axes[0].set_title(f"Original Image Preds (total: {len(random_test_sample_pred_boxes)})")
axes[
# Display image 2
1].imshow(test_image_with_preds_filtered)
axes[1].axis("off") # Hide axes
axes[1].set_title(f"Filtered Image Preds (total: {len(keep_boxes)})")
axes[
# Show the plot
"Simple NMS - Only keep the highest scoring box per prediction")
plt.suptitle(
plt.tight_layout(); plt.show()
[INFO] Labels with scores: ['Pred: bin (0.6625)', 'Pred: hand (0.5412)', 'Pred: trash (0.5007)', 'Pred: trash (0.4147)', 'Pred: trash (0.396)', 'Pred: not_trash (0.3237)', 'Pred: hand (0.2799)']
[INFO] Labels with scores: ['Pred: bin (0.6625)', 'Pred: hand (0.5412)', 'Pred: trash (0.5007)', 'Pred: not_trash (0.3237)']
TK - problem with simple filtering is that it might take out a box that would’ve been helpful, it also assumes that there’s little false positives (e.g. each box is predicting the class that it should predict)
22.3 TK - Greedy IoU Filtering - Intersection over Union - If a pair of boxes have an IoU over a certain threshold, keep the box with the higher score
- IoU in torchmetrics - https://lightning.ai/docs/torchmetrics/stable/detection/intersection_over_union.html
To calculate the Intersection over Union (IoU) between two bounding boxes:
- Coordinates of the intersection rectangle: \[ x_{\text{left}} = \max(x_{1A}, x_{1B}) \] \[ y_{\text{top}} = \max(y_{1A}, y_{1B}) \] \[ x_{\text{right}} = \min(x_{2A}, x_{2B}) \] \[ y_{\text{bottom}} = \min(y_{2A}, y_{2B}) \]
Where:
\[ \text{A} = \text{Box 1} \] \[ \text{B} = \text{Box 2} \]
Width and height of the intersection: \[ \text{intersection\_width} = \max(0, x_{\text{right}} - x_{\text{left}}) \] \[ \text{intersection\_height} = \max(0, y_{\text{bottom}} - y_{\text{top}}) \]
Area of Overlap: \[ \text{Area of Overlap} = \text{intersection\_width} \times \text{intersection\_height} \]
Area of Union: \[ \text{Area of Union} = \text{Area of Box 1} + \text{Area of Box 2} - \text{Area of Overlap} \]
Intersection over Union (IoU): $$ = /
# IoU = Intersection / Union
# Inserction =
# x_left = max(x1_A, x1_B)
# y_top = max(y1_A, y1_B)
# x_right = min(x2_A, x2_B)
# y_bottom = min(y2_A, x2_B)
#
# Where:
# A = Box 1
# B = Box 2
# intersection_width = max(0, x_right - x_left)
# interesection_height = max(0, y_bottom - y_top)
# area_intersection = intersection_width * intersection_height
# Union = area_box_1 + area_box_2 - intersection
def intersection_over_union_score(box_1, box_2):
"""Calculates Intersection over Union (IoU) score for two given boxes in XYXY format."""
assert len(box_1) == 4, f"Box 1 should have four elements in the format [x_1, y_1, x_2, y_2] but has: {len(box_1)}, see: {box_1}"
assert len(box_2) == 4, f"Box 2 should have four elements in the format [x_1, y_1, x_2, y_2] but has: {len(box_2)}, see: {box_2}"
= box_1[0], box_1[1], box_1[2], box_1[3]
x1_box_1, y1_box_1, x2_box_1, y2_box_1 = box_2[0], box_2[1], box_2[2], box_2[3]
x1_box_2, y1_box_2, x2_box_2, y2_box_2
# Get coordinates of overlapping box (note: there may not be any overlapping box)
= torch.max(x1_box_1, x1_box_2)
x_left = torch.max(y1_box_1, y1_box_2)
y_top = torch.min(x2_box_1, x2_box_2)
x_right = torch.min(y2_box_1, y2_box_2)
y_bottom
# Calculate the intersection width and height (we take the max of 0 and the value to find non-overlapping boxes)
= max(0, x_right - x_left)
intersection_width = max(0, y_bottom - y_top)
intersection_height
# Calculate the area of intersection (note: this will 0 if either width or height are 0)
= intersection_height * intersection_width
area_of_intersection
# Calculate individual box areas
= (x2_box_1 - x1_box_1) * (y2_box_1 - y1_box_1) # width * height
box_1_area = (x2_box_2 - x1_box_2) * (y2_box_2 - y1_box_2)
box_2_area
# Calcuate area of union (sum of box areas minus the intersection area)
= box_1_area + box_2_area - area_of_intersection
area_of_union
# Calculate the IoU score
= area_of_intersection / area_of_union
iou_score
return iou_score
= intersection_over_union_score(box_1=random_test_sample_pred_boxes[4],
iou_score_test_pred_boxes =random_test_sample_pred_boxes[3])
box_2
print(f"[INFO] IoU Score: {iou_score_test_pred_boxes}")
0], random_test_sample_pred_boxes[1] random_test_sample_pred_boxes[
[INFO] IoU Score: 0.7790185809135437
(tensor([ 10.7812, 393.1250, 950.1562, 1160.6250], device='cuda:0',
grad_fn=<SelectBackward0>),
tensor([ 149.8828, 667.9688, 471.6797, 1018.2812], device='cuda:0',
grad_fn=<SelectBackward0>))
# TK - for visualization purposes, write code to highlight the intersecting points on a box and print the IoU score in the middle of the box
# IoU logic
# 1. General IoU threshold (removing boxes at a global level, regardless of label)
# -> for box pairs with IoU > 0.9, keep the higher scoring box
# 2. Label specific IoU threshold (only concern is comparing boxes with the same label)
# -> for box pairs with same label and IoU > 0.5, keep the higher scoring box
= []
keep_boxes = []
keep_scores = []
keep_labels
= random_test_sample_outputs_post_processed[0]["scores"]
random_test_sample_pred_scores = random_test_sample_outputs_post_processed[0]["labels"]
random_test_sample_pred_labels = random_test_sample_outputs_post_processed[0]["boxes"]
random_test_sample_pred_boxes
= torch.ones(len(random_test_sample_pred_boxes), dtype=torch.bool)
keep_indexes
= 0.9 # general threshold = remove the lower scoring box in box pairs with over iou_general_threshold regardless of the label
iou_general_threshold = 0.5 # remove overlapping similar classes
iou_class_level_threshold
# TODO: Add a clause here to include if class labels are the same, then filter based on the class-specifc IoU threshold
= True
filter_global = True
filter_same_label
# Count the total loops
= 0
total_loops
for i, box_A in enumerate(random_test_sample_pred_boxes):
if not keep_indexes[i]: # insert clause to prevent calculating on already filtered labels
continue
for j, box_B in enumerate(random_test_sample_pred_boxes):
if not keep_indexes[i]:
continue
# Only calculate IoU score if indexes aren't the same (saves comparing the same index boxes for unwanted calculations)
if (i != j):
= intersection_over_union_score(box_1=box_A, box_2=box_B)
iou_score print(f"[INFO] IoU Score for box {(i, j)}: {iou_score}")
if filter_global:
if iou_score > iou_general_threshold:
= random_test_sample_pred_scores[i], random_test_sample_pred_scores[j]
score_A, score_B if score_A > score_B:
print(f"[INFO] Box to keep index: {i} -> {box_A}")
= False
keep_indexes[j] else:
print(f"[INFO] Box to keep index: {j} -> {box_B}")
= False
keep_indexes[i]
if filter_same_label:
if iou_score > iou_class_level_threshold:
= random_test_sample_pred_labels[i]
i_label = random_test_sample_pred_labels[j]
j_label if i_label == j_label:
print(f"Labels are equal: {i_label, j_label}")
= random_test_sample_pred_scores[i], random_test_sample_pred_scores[j]
score_A, score_B if score_A > score_B:
print(f"[INFO] Box to keep index: {i} -> {box_A}")
= False
keep_indexes[j] else:
print(f"[INFO] Box to keep index: {j} -> {box_B}")
= False
keep_indexes[i]
+= 1
total_loops
print(keep_indexes)
= random_test_sample_pred_scores[keep_indexes]
keep_scores = random_test_sample_pred_labels[keep_indexes]
keep_labels = random_test_sample_pred_boxes[keep_indexes]
keep_boxes
print(len(random_test_sample_pred_boxes), len(random_test_sample_pred_labels), len(random_test_sample_pred_boxes))
print(len(keep_scores), len(keep_labels), len(keep_boxes), sum(keep_indexes))
print(f"[INFO] Number of total loops: {total_loops}, max possible loops: {len(random_test_sample_pred_boxes)**2}")
[INFO] IoU Score for box (0, 1): 0.156358003616333
[INFO] IoU Score for box (0, 2): 0.10704872757196426
[INFO] IoU Score for box (0, 3): 0.3096315264701843
[INFO] IoU Score for box (0, 4): 0.3974636495113373
[INFO] IoU Score for box (0, 5): 0.06380129605531693
[INFO] IoU Score for box (0, 6): 0.2954297661781311
[INFO] IoU Score for box (1, 0): 0.156358003616333
[INFO] IoU Score for box (1, 2): 0.11466032266616821
[INFO] IoU Score for box (1, 3): 0.2778415083885193
[INFO] IoU Score for box (1, 4): 0.36936208605766296
[INFO] IoU Score for box (1, 5): 0.08170551061630249
[INFO] IoU Score for box (1, 6): 0.4092644155025482
[INFO] IoU Score for box (2, 0): 0.10704872757196426
[INFO] IoU Score for box (2, 1): 0.11466032266616821
[INFO] IoU Score for box (2, 3): 0.34572935104370117
[INFO] IoU Score for box (2, 4): 0.26932957768440247
[INFO] IoU Score for box (2, 5): 0.17588727176189423
[INFO] IoU Score for box (2, 6): 0.058975815773010254
[INFO] IoU Score for box (3, 0): 0.3096315264701843
[INFO] IoU Score for box (3, 1): 0.2778415083885193
[INFO] IoU Score for box (3, 2): 0.34572935104370117
[INFO] IoU Score for box (3, 4): 0.7790185809135437
Labels are equal: (tensor(5, device='cuda:0'), tensor(5, device='cuda:0'))
[INFO] Box to keep index: 3 -> tensor([248.2031, 472.6562, 675.7031, 994.8438], device='cuda:0',
grad_fn=<UnbindBackward0>)
[INFO] IoU Score for box (3, 5): 0.11186295002698898
[INFO] IoU Score for box (3, 6): 0.1719416379928589
[INFO] IoU Score for box (5, 0): 0.06380129605531693
[INFO] IoU Score for box (5, 1): 0.08170551061630249
[INFO] IoU Score for box (5, 2): 0.17588727176189423
[INFO] IoU Score for box (5, 3): 0.11186295002698898
[INFO] IoU Score for box (5, 4): 0.0963958203792572
[INFO] IoU Score for box (5, 6): 0.05411146208643913
[INFO] IoU Score for box (6, 0): 0.2954297661781311
[INFO] IoU Score for box (6, 1): 0.4092644155025482
[INFO] IoU Score for box (6, 2): 0.058975815773010254
[INFO] IoU Score for box (6, 3): 0.1719416379928589
[INFO] IoU Score for box (6, 4): 0.24588997662067413
[INFO] IoU Score for box (6, 5): 0.05411146208643913
tensor([ True, True, True, True, False, True, True])
7 7 7
6 6 6 tensor(6)
[INFO] Number of total loops: 42, max possible loops: 49
# tensor([ True, True, True, True, True, False, True, False])
# tensor([ True, True, True, True, True, False, True, False])
# Create a list of labels to plot on the boxes
= [f"Pred: {id2label[label_pred.item()]} ({round(score_pred.item(), 4)})"
random_test_sample_labels_to_plot for label_pred, score_pred in zip(random_test_sample_pred_labels, random_test_sample_pred_scores)]
print(f"[INFO] Labels with scores: {random_test_sample_labels_to_plot}")
# Plot the predicted boxes on the random test image
= to_pil_image(
test_image_with_preds_original =draw_bounding_boxes(
pic=pil_to_tensor(pic=dataset["test"][random_test_pred_index]["image"]),
image=random_test_sample_pred_boxes,
boxes=random_test_sample_labels_to_plot,
labels=3
width
)
)
### Create image with filtered boxes
# Create a list of labels to plot on the boxes
= [f"Pred: {id2label[label_pred.item()]} ({round(score_pred.item(), 4)})"
random_test_sample_labels_to_plot_filtered for label_pred, score_pred in zip(keep_labels, keep_scores)]
print(f"[INFO] Labels with scores: {random_test_sample_labels_to_plot_filtered}")
# Plot the predicted boxes on the random test image
= to_pil_image(
test_image_with_preds_filtered =draw_bounding_boxes(
pic=pil_to_tensor(pic=dataset["test"][random_test_pred_index]["image"]),
image=keep_boxes,
boxes=random_test_sample_labels_to_plot_filtered,
labels=3
width
)
)
# Visualize the transformed image
import matplotlib.pyplot as plt
# Create a figure with two subplots
= plt.subplots(1, 2, figsize=(20, 10))
fig, axes
# Display image 1
0].imshow(test_image_with_preds_original)
axes[0].axis("off") # Hide axes
axes[0].set_title(f"Original Image Preds (total: {len(random_test_sample_pred_boxes)})")
axes[
# Display image 2
1].imshow(test_image_with_preds_filtered)
axes[1].axis("off") # Hide axes
axes[1].set_title(f"Filtered Image Preds (total: {len(keep_boxes)})")
axes[
# Show the plot
f"Greedy IoU Filtering (General) - For boxes with IoU > {iou_general_threshold}, keep the higher scoring box")
plt.suptitle(
plt.tight_layout(); plt.show()
[INFO] Labels with scores: ['Pred: bin (0.6625)', 'Pred: hand (0.5412)', 'Pred: trash (0.5007)', 'Pred: trash (0.4147)', 'Pred: trash (0.396)', 'Pred: not_trash (0.3237)', 'Pred: hand (0.2799)']
[INFO] Labels with scores: ['Pred: bin (0.6625)', 'Pred: hand (0.5412)', 'Pred: trash (0.5007)', 'Pred: trash (0.4147)', 'Pred: not_trash (0.3237)', 'Pred: hand (0.2799)']
# TK - more NMS logic:
# If there are more than two hands, keep the one with the higher score...
23 TK - Create a Demo with Simple NMS Filtering (only keep the highest scoring boxes per image)
UPTOHERE:
- upload the demo to Hugging Face Spaces as Trashify V3
- Make sure the demo works
- Go back through the code and start tidying up/explaining things
- Create a blog post to discuss different box formats in object detection
- Create a blog post for NMS + IoU filtering (can create an IoU function that colours in the intersection parts)
- Create an extension for longer training + synthetic data + evaluation metrics + deploying on transformers.js
# Make directory for demo
from pathlib import Path
= Path("demos/trashify_object_detector_data_aug_model_with_nms/")
trashify_data_aug_model_dir =True) trashify_data_aug_model_dir.mkdir(exist_ok
%%writefile demos/trashify_object_detector_data_aug_model_with_nms/requirements.txt
timm
gradio
torch transformers
Overwriting demos/trashify_object_detector_data_aug_model_with_nms/requirements.txt
%%writefile demos/trashify_object_detector_data_aug_model_with_nms/README.md
---
title: Trashify Demo V3 🚮
emoji: 🗑️
colorFrom: purple
colorTo: blue
sdk: gradio4.40.0
sdk_version:
app_file: app.py
pinned: false-2.0
license: apache---
# 🚮 Trashify Object Detector Demo V3
Object detection demo to detect `trash`, `bin`, `hand`, `trash_arm`, `not_trash`, `not_bin`, `not_hand`.
as example for encouraging people to cleanup their local area.
Used
all detected = +1 point.
If `trash`, `hand`, `bin`
## Dataset
-labelled dataset of people picking up trash and placing it in a bin.
All Trashify models are trained on a custom hand
as [`mrdbourke/trashify_manual_labelled_images`](https://huggingface.co/datasets/mrdbourke/trashify_manual_labelled_images).
The dataset can be found on Hugging Face
## Demos
* [V1](https://huggingface.co/spaces/mrdbourke/trashify_demo_v1) = Fine-tuned DETR model trained *without* data augmentation.
* [V2](https://huggingface.co/spaces/mrdbourke/trashify_demo_v2) = Fine-tuned DETR model trained *with* data augmentation.
* [V3](https://huggingface.co/spaces/mrdbourke/trashify_demo_v3) = Fine-tuned DETR model trained *with* data augmentation (same as V2) with an NMS (Non Maximum Suppression) post-processing step.
- finish the README.md + update with links to materials TK
Overwriting demos/trashify_object_detector_data_aug_model_with_nms/README.md
%%writefile demos/trashify_object_detector_data_aug_model_with_nms/app.py
import gradio as gr
import torch
from PIL import Image, ImageDraw, ImageFont
from transformers import AutoImageProcessor
from transformers import AutoModelForObjectDetection
# Note: Can load from Hugging Face or can load from local.
# You will have to replace {mrdbourke} for your own username if the model is on your Hugging Face account.
= "mrdbourke/detr_finetuned_trashify_box_detector_with_data_aug"
model_save_path
# Load the model and preprocessor
= AutoImageProcessor.from_pretrained(model_save_path)
image_processor = AutoModelForObjectDetection.from_pretrained(model_save_path)
model
= "cuda" if torch.cuda.is_available() else "cpu"
device = model.to(device)
model
# Get the id2label dictionary from the model
= model.config.id2label
id2label
# Set up a colour dictionary for plotting boxes with different colours
= {
color_dict "bin": "green",
"trash": "blue",
"hand": "purple",
"trash_arm": "yellow",
"not_trash": "red",
"not_bin": "red",
"not_hand": "red",
}
# Create helper functions for seeing if items from one list are in another
def any_in_list(list_a, list_b):
"Returns True if any item from list_a is in list_b, otherwise False."
return any(item in list_b for item in list_a)
def all_in_list(list_a, list_b):
"Returns True if all items from list_a are in list_b, otherwise False."
return all(item in list_b for item in list_a)
def filter_highest_scoring_box_per_class(boxes, labels, scores):
"""
Perform NMS (Non-max Supression) to only keep the top scoring box per class.
Args:
boxes: tensor of shape (N, 4)
labels: tensor of shape (N,)
scores: tensor of shape (N,)
Returns:
boxes: tensor of shape (N, 4) filtered for max scoring item per class
labels: tensor of shape (N,) filtered for max scoring item per class
scores: tensor of shape (N,) filtered for max scoring item per class
"""
# Start with a blank keep mask (e.g. all False and then update the boxes to keep with True)
= torch.zeros(len(boxes), dtype=torch.bool)
keep_mask
# For each unique class
for class_id in labels.unique():
# Get the indicies for the target class
= labels == class_id
class_mask
# If any of the labels match the current class_id
if class_mask.any():
# Find the index of highest scoring box for this specific class
= scores[class_mask]
class_scores = class_scores.argmax()
highest_score_idx
# Convert back to the original index
= torch.where(class_mask)[0][highest_score_idx]
original_idx
# Update the index in the keep mask to keep the highest scoring box
= True
keep_mask[original_idx]
return boxes[keep_mask], labels[keep_mask], scores[keep_mask]
def create_return_string(list_of_predicted_labels, target_items=["trash", "bin", "hand"]):
# Setup blank string to print out
= ""
return_string
# If no items detected or trash, bin, hand not in list, return notification
if (len(list_of_predicted_labels) == 0) or not (any_in_list(list_a=target_items, list_b=list_of_predicted_labels)):
= f"No trash, bin or hand detected at confidence threshold {conf_threshold}. Try another image or lowering the confidence threshold."
return_string return return_string
# If there are some missing, print the ones which are missing
elif not all_in_list(list_a=target_items, list_b=list_of_predicted_labels):
= []
missing_items for item in target_items:
if item not in list_of_predicted_labels:
missing_items.append(item)= f"Detected the following items: {list_of_predicted_labels} (total: {len(list_of_predicted_labels)}). But missing the following in order to get +1: {missing_items}. If this is an error, try another image or altering the confidence threshold. Otherwise, the model may need to be updated with better data."
return_string
# If all 3 trash, bin, hand occur = + 1
if all_in_list(list_a=target_items, list_b=list_of_predicted_labels):
= f"+1! Found the following items: {list_of_predicted_labels} (total: {len(list_of_predicted_labels)}), thank you for cleaning up the area!"
return_string
print(return_string)
return return_string
def predict_on_image(image, conf_threshold):
with torch.no_grad():
= image_processor(images=[image], return_tensors="pt")
inputs = model(**inputs.to(device))
outputs
= torch.tensor([[image.size[1], image.size[0]]]) # height, width
target_sizes
= image_processor.post_process_object_detection(outputs,
results =conf_threshold,
threshold=target_sizes)[0]
target_sizes# Return all items in results to CPU
for key, value in results.items():
try:
= value.item().cpu() # can't get scalar as .item() so add try/except block
results[key] except:
= value.cpu()
results[key]
# Can return results as plotted on a PIL image (then display the image)
= ImageDraw.Draw(image)
draw
# Create a copy of the image to draw on it for NMS
= image.copy()
image_nms = ImageDraw.Draw(image_nms)
draw_nms
# Get a font from ImageFont
= ImageFont.load_default(size=20)
font
# Get class names as text for print out
= []
class_name_text_labels
# TK - update this for NMS
= []
class_name_text_labels_nms
# Get original boxes, scores, labels
= results["boxes"]
original_boxes = results["labels"]
original_labels = results["scores"]
original_scores
# Filter boxes and only keep 1x of each label with highest score
= filter_highest_scoring_box_per_class(boxes=original_boxes,
filtered_boxes, filtered_labels, filtered_scores =original_labels,
labels=original_scores)
scores# TODO: turn this into a function so it's cleaner?
for box, label, score in zip(original_boxes, original_labels, original_scores):
# Create coordinates
= tuple(box.tolist())
x, y, x2, y2
# Get label_name
= id2label[label.item()]
label_name = color_dict[label_name]
targ_color
class_name_text_labels.append(label_name)
# Draw the rectangle
=(x, y, x2, y2),
draw.rectangle(xy=targ_color,
outline=3)
width
# Create a text string to display
= f"{label_name} ({round(score.item(), 3)})"
text_string_to_show
# Draw the text on the image
=(x, y),
draw.text(xy=text_string_to_show,
text="white",
fill=font)
font
# TODO: turn this into a function so it's cleaner?
for box, label, score in zip(filtered_boxes, filtered_labels, filtered_scores):
# Create coordinates
= tuple(box.tolist())
x, y, x2, y2
# Get label_name
= id2label[label.item()]
label_name = color_dict[label_name]
targ_color
class_name_text_labels_nms.append(label_name)
# Draw the rectangle
=(x, y, x2, y2),
draw_nms.rectangle(xy=targ_color,
outline=3)
width
# Create a text string to display
= f"{label_name} ({round(score.item(), 3)})"
text_string_to_show
# Draw the text on the image
=(x, y),
draw_nms.text(xy=text_string_to_show,
text="white",
fill=font)
font
# Remove the draw each time
del draw
del draw_nms
# Create the return string
= create_return_string(list_of_predicted_labels=class_name_text_labels)
return_string = create_return_string(list_of_predicted_labels=class_name_text_labels_nms)
return_string_nms
return image, return_string, image_nms, return_string_nms
# Create the interface
= gr.Interface(
demo =predict_on_image,
fn=[
inputstype="pil", label="Target Image"),
gr.Image(=0, maximum=1, value=0.25, label="Confidence Threshold")
gr.Slider(minimum
],=[
outputstype="pil", label="Image Output (no filtering)"),
gr.Image(="Text Output (no filtering)"),
gr.Text(labeltype="pil", label="Image Output (with max score per class box filtering)"),
gr.Image(="Text Output (with max score per class box filtering)")
gr.Text(label
],="🚮 Trashify Object Detection Demo V3",
title="""Help clean up your local area! Upload an image and get +1 if there is all of the following items detected: trash, bin, hand.
description
The model in V3 is [same model](https://huggingface.co/mrdbourke/detr_finetuned_trashify_box_detector_with_data_aug) as in [V2](https://huggingface.co/spaces/mrdbourke/trashify_demo_v2) (trained with data augmentation) but has an additional post-processing step (NMS or [Non Maximum Suppression](https://paperswithcode.com/method/non-maximum-suppression)) to filter classes for only the highest scoring box of each class.
""",
# Examples come in the form of a list of lists, where each inner list contains elements to prefill the `inputs` parameter with
=[
examples"examples/trashify_example_1.jpeg", 0.25],
["examples/trashify_example_2.jpeg", 0.25],
["examples/trashify_example_3.jpeg", 0.25]
[
],=True
cache_examples
)
# Launch the demo
demo.launch()
Overwriting demos/trashify_object_detector_data_aug_model_with_nms/app.py
23.1 TK - Upload our demo to the Hugging Face Hub
# 1. Import the required methods for uploading to the Hugging Face Hub
from huggingface_hub import (
create_repo,
get_full_repo_name,# for uploading a single file (if necessary)
upload_file, # for uploading multiple files (in a folder)
upload_folder
)
# 2. Define the parameters we'd like to use for the upload
= "demos/trashify_object_detector_data_aug_model_with_nms" # TK - update this path
LOCAL_DEMO_FOLDER_PATH_TO_UPLOAD = "trashify_demo_v3"
HF_TARGET_SPACE_NAME = "space" # we're creating a Hugging Face Space
HF_REPO_TYPE = "gradio"
HF_SPACE_SDK = "" # optional: set to your Hugging Face token (but I'd advise storing this as an environment variable as previously discussed)
HF_TOKEN
# 3. Create a Space repository on Hugging Face Hub
print(f"[INFO] Creating repo on Hugging Face Hub with name: {HF_TARGET_SPACE_NAME}")
create_repo(=HF_TARGET_SPACE_NAME,
repo_id# token=HF_TOKEN, # optional: set token manually (though it will be automatically recognized if it's available as an environment variable)
=HF_REPO_TYPE,
repo_type=False, # set to True if you don't want your Space to be accessible to others
private=HF_SPACE_SDK,
space_sdk=True, # set to False if you want an error to raise if the repo_id already exists
exist_ok
)
# 4. Get the full repository name (e.g. {username}/{model_id} or {username}/{space_name})
= get_full_repo_name(model_id=HF_TARGET_SPACE_NAME)
full_hf_repo_name print(f"[INFO] Full Hugging Face Hub repo name: {full_hf_repo_name}")
# 5. Upload our demo folder
print(f"[INFO] Uploading {LOCAL_DEMO_FOLDER_PATH_TO_UPLOAD} to repo: {full_hf_repo_name}")
= upload_folder(
folder_upload_url =full_hf_repo_name,
repo_id=LOCAL_DEMO_FOLDER_PATH_TO_UPLOAD,
folder_path=".", # upload our folder to the root directory ("." means "base" or "root", this is the default)
path_in_repo# token=HF_TOKEN, # optional: set token manually
=HF_REPO_TYPE,
repo_type="Uploading Trashify box detection model v3 app.py with NMS post processing"
commit_message
)print(f"[INFO] Demo folder successfully uploaded with commit URL: {folder_upload_url}")
[INFO] Creating repo on Hugging Face Hub with name: trashify_demo_v3
[INFO] Full Hugging Face Hub repo name: mrdbourke/trashify_demo_v3
[INFO] Uploading demos/trashify_object_detector_data_aug_model_with_nms to repo: mrdbourke/trashify_demo_v3
[INFO] Demo folder successfully uploaded with commit URL: https://huggingface.co/spaces/mrdbourke/trashify_demo_v3/tree/main/.
23.2 tK - Embed the Space to Test the Model
from IPython.display import HTML
# You can get embeddable HTML code for your demo by clicking the "Embed" button on the demo page
='''
HTML(data<iframe
src="https://mrdbourke-trashify-demo-v3.hf.space"
frameborder="0"
width="1000"
height="1600"
></iframe>
''')
# UPTOHERE
# Next, focus on a single input -> output ✅
# Show case what an output from the model looks like untrained (e.g. plot the next boxes on it) ✅
# After showcasing 1x prediction, move onto training a model and seeing if we can get it to improve ✅
# Continually focus on 1 input, 1 output until we can scale up ✅
# Create a demo of our model and upload it to Hugging Face ✅
# Add examples to test the demo ✅
# Write code to upload the demo to Hugging Face ✅
# Create visualization of input and output of data augmentation ✅
# Create demo of model with data augmentation ✅
# Model 2: Try improve our model with data augmentation ✅
# Visualize data augmentation examples in and out of the model
# Note: looks like augmentation may hurt our results... 🤔, this is because our data is so similar, potentially could help with more diverse data, e.g. synthetic data
# Try in a demo and see how it works -> Trashify Demo V2 ✅
# Extension: Also try a model training for longer
# Model 3 (just improve with NMS): Create NMS option so only highest quality boxes are kept for each class ✅
# Next:
# Go through notebook and clean it up for
# Once we've got a better performing model, introduce evaluation metrics
# End: three models, three demos, one without data augmentation, one with it, one with NMS (post-processing) + can have as an extension to train the model for longer and see what happens
# Extensions:
# Train a model for longer and see if it improves (e.g. 72 epochs)
# Workflow:
# Untrained model -> input/output -> poor results (always visualize, visualize, visualize!)
# Trained model -> input/output -> better results (always visualize, visualize, visualize!)
# Outline:
# Single input/output with untrained model (bad output)
# Train model to improve on single input/output
# Introduce evaluation metric
# Introduce data augmentation, see D-FINE paper for data augmentation options (we can keep it simple)
# See: https://arxiv.org/pdf/2410.13842
# "The total batch size is 32 across all variants. Training schedules include 72 epochs with advanced augmentation (RandomPhotometricDistort, RandomZoomOut, RandomIoUCrop, and RMultiScaleInput)
# followed by 2 epochs without advanced augmentation for D-FINE-X and D-FINE-L, and 120 epochs with advanced augmentation followed by 4
# epochs without advanced augmentation for D-FINE-M and D-FINE-S (RT-DETRv2 Training Strategy (Lv et al., 2024) in Table 3)"
# TODO: Read RT-DETRv2 training strategy from paper mentioned above
# TODO: Read PP-YOLO data augmentation paper (keep it simple to begin with, can increase when needed)
# Create demo with Gradio
# Create demo with Transformers.js, see: https://huggingface.co/docs/transformers.js/en/tutorials/vanilla-js
24 Extensions + Extra-Curriculum
- Extension: possibly improve the model with synthetic data? e.g. on classes/bins not visible in the model
- Extension: train the model for longer and see how it improves, this could be model v4
- Baselines:
- V1 = model no data augmentaiton
- V2 = model with data augmentation
- V3 = model with NMS (post processing)
- Extensions:
- V4 = model trained for longer with NMS
- V5 = synthetic data scaled up…?
- Baselines:
- Extension: Zero-shot object detection - but what if I don’t have labels?
- This could discuss the use of zero-shot object detection models such as GroundingDINO and OmDet
- See OmDet - https://huggingface.co/omlab/omdet-turbo-swin-tiny-hf
- See GroundingDINO - https://huggingface.co/docs/transformers/en/model_doc/grounding-dino
- Extension: Try to repeat the workflow we’ve gone through with another model such as https://huggingface.co/IDEA-Research/dab-detr-resnet-50-dc5-pat3 (apparently it is slightly better performing on COCO too)
25 Summary
- Bounding box formats: An important step in any object detection project is to figure out what format your bounding boxes are in.
26 Extra resources
- A Guide to Bounding Box Formats and How to Draw Them by Daniel Bourke.