Glossary

This is messy and a work in progress. Will tidy up later.

Predictive vs generative AI = predictive AI -> machine readable outputs, generative AI -> human readable outputs. Predictive style models take in data and map it to a fixed output space (e.g. a text classification model predicting whether an email is spam or not). Generative AI models take in data and generate an unbounded response (though theorectically this response is bounded by the training distribution), such as, a chat system taking in natural language instructions and producing natural language as output. Generative AI models can be turned into predictive-style models, for example a generative LLM could produce JSON outputs if instructed/constrained to do so.
Transformer = A deep learning model that adopts the attention mechanism to draw global dependencies between input and output
Tokenization = turning a series of data (text or image) into a series of tokens, where a token is a numerical representation of the input data, for example, in the case of text, tokenization could mean turning the words in a sentence into numbers (e.g. “hello world” -> [101, 102])
Tokens = a token is a letter, word or word-piece (word) that a model uses to represent input data, for example, in the case of text, a token could be a word (e.g. “hello”) or a word-piece (e.g. “hell” and “o”), see: https://platform.openai.com/tokenizer for an example
transformers = A Python library by Hugging Face that provides a wide range of pre-trained transformer models, fine-tuning tools, and utilities to use them
datasets = A Python library by Hugging Face that provides a wide range of datasets for NLP and CV tasks
tokenizers = A Python library by Hugging Face that provides a wide range of tokenizers for NLP tasks
evaluate = A Python library by Hugging Face with premade evaluation functions for various tasks
torch = PyTorch, an open-source machine learning library
transformers.pipeline = an abstraction to get a machine learning pipeline up and running in a few lines of code, handles data preprocessing and device placement behind the scences. For example, transformers.pipeline("text-classification") can be used to tokenize input text and classify it.
transfer learning = taking what one model has learned and applying it to another task (e.g. a model which has learned across many millions of words of text from the internet and then adjusting it to work with your smaller dataset)
fine-tuning = a type of transfer learning where you take the existing patterns of one model (usually trained on a very large dataset) and customize them to work for your smaller dataset
full fine-tuning = fine-tune all of a models parameters to your dataset
partial fine-tuning = only fine-tune a portion of a models parameters to your dataset
feature extraction fine-tuning = only fine-tune the final layer(s) of model to your dataset (e.g. the majority of the backbone stays frozen)
LoRA (Low-Rank Adaptation) = train an adaptor matrix (far fewer parameters than a full model) to apply to your base model weights (base model weights stay frozen)
hyperparameters = values you can set to adjust training settings, for example, learning rate is a hyperparameter that is adjustable
Hugging Face Hub (or Hub for short) = Place to store datasets, models, and other resources of your own + find existing datasets, models & scripts others have shared. If you are familiar with GitHub, Hugging Face is like the GitHub of machine learning.
Auto Classes = A series of classes in transformers which enables automatic loading of preprocessor or model classes based on the name or path of the model. For example you can load the processor for microsoft/conditional-detr-resnet-50 with transformers.AutoImageProcessor(microsoft/conditional-detr-resnet-50) or the model architecture with transformers.AutoModelForObjectDetection(microsoft/conditional-detr-resnet-50).
Hugging Face Spaces = A platform to share and run machine learning apps/demos, these can be built with HTML, Gradio or Streamlit
HF = Hugging Face
NLP = Natural Language Processing
CV = Computer Vision
Image classification = Classify an image in a single or multiple classes (classifying something as multiple items or labels such as [warm, well lit, sunset] is also known as tagging or more specifically, image tagging), for example, is a photo of food or not food.
Object detection = Detect and locate an item in an image or series of images (e.g. a video). An item can be almost anything in an image, for example, a licence plate, a person, a weed in a garden or a small bug on the body of a bee.
Bounding box = A box, often rectangular in nature, drawn around an item in an image to indicate its location. Can come in several different forms such as XYXY, XYWH and CXCYWH (see more in A Guide to Bounding Box Formats and How to Draw Them).
TPU = Tensor Processing Unit
GPU = Graphics Processing Unit
Learning rate = Often the most important hyperparameter to tune. It is proportional with the amount an optimizer will update a model’s parameters every update step. A higher amount means larger updates (though sometimes too large) a lower amount means smaller updates (though sometimes not enough). The most ideal learning rate is experimental. Common values include 0.001, 0.0001, 0.0005, 0.00001, 0.00005 (though the learning rate can be any value). Many optimizers have decent default learning rates. For example, the Adam optimizer (a common and generally well performing optimizer) in PyTorch (torch.optim.Adam) has a default learning rate of 0.001. For fine-tuning an already trained model a learning rate of 10x smaller than the default is a good rule of thumb (e.g. if a model was trained with a learning rate of 0.001, fine-tuning with 0.0001 is common). The learning rate does not have to be static and can change dynamcially during training, this practice is referred to as learning rate scheduling.
Inference = using a trained (or untrained) model to make predictions on a given piece of data. The model infers what the output should be based on the inputs. Inference is often much faster than training on a sample per sample basis because no weights get updated during inference. Though, when compared to training, inference can often take more compute in the long run. Because a model can be trained once but then used for inference millions of times (or more) over the next several months (or longer).
Prediction probability = the probability of a model’s prediction for a given input, is a score between 0 and 1 with 1 being the highest, for example, a model may have a prediction probability of 0.95, this would mean it’s quite confident with its prediction but it doesn’t mean it’s correct. A good way to inspect potential issues with a dataset is to show examples in the test set which have a high prediction probability but are wrong (e.g. pred prob = 0.98 but the prediction was incorrect).
Hugging Face Pipeline (pipeline) = A high-level API for using model for various tasks (e.g. text-classification, audio-classification, image-classification, object-detection and more), see the docs: https://huggingface.co/docs/transformers/v4.41.3/en/main_classes/pipelines#transformers.pipeline
loss value = Measure of how wrong your model is by a given metric. A perfect model will have a loss value of 0 (it is able to predict the data perfectly), though this is highly unlikely in practice (there are no perfect models). Ideally, the loss value will go down (towards 0) as training goes on. If the loss value on the training set is lower than the loss value on the test set, the model is likely overfitting (memorizing the training set too well rather than learning generalizable patterns to unseen data). To fix overfitting, introduce more regularization. To fix underfitting (loss not going down), introduce more learning capacity (more data, more parameters in the model, longer training). Machine learning is a constant battle between overfitting and underfitting.
Random seed = Value to flavour the randomness of an operation. For example, if you set a random seed to 42 the numbers produced by a random generator will be random but flavoured by the seed. This means if the seed stays at 42, subsequent calls of the same operation will return the same values. Not setting a random seed will result in different random values each time. Setting a random seed is done to ensure reproducibility of an operation. This is helpful when performing experiments and you do not want the outputs to be random each time.
Synthetic data generation = using a model such as a generative Large Language Model (LLM) to generate synthetic pieces of data for a specific problem. For example, getting an LLM to generate food and not food image captions to create a binary text classification model. Synthetic data is very helpful when bootstrapping a machine learning problem. Though it is advised to only train on synthetic data and to evaluate on real data whenever possible.
Pre-trained models = models which have already been trained on a large dataset, for example, text-based models which have gone through many millions of words of text (e.g. all of Wikipedia and 1000s of books) or image-based models which have seen millions of images (e.g. models trained on ImageNet). In essence, any model which has already spent a large amount of time learning patterns in data. These patterns can then be adjusted for your own sample problems, often with much much smaller amounts of data for excellent results. The process of customizing a pre-trained model for a specific problem is called transfer learning (transferring what an existing model knows to your own problem).
Training/test split = One of the most important concepts in machine learning. Train models on the training data and evaluate them on the test data. The test data should never be seen by a model during training. Think of the test data as the final exam in a university course. A model should be able to learn enough patterns in the training set to perform well on the test set. Just like a student should be able to learn enough on course materials to do well on the final exam. If a model performs well on the training set but not well on the test set, this is known as overfitting, as in, the model memorizes the training set rather than learning generalizable patterns to unseen data. If a model performs poorly on both the training set and the test set, this is known as underfitting.
Prediction probabilities = a value assigned to a model’s prediction on a certain sample after its output logits have passed through an activation function such as Softmax or Sigmoid. For example, in a binary classification problem of whether an image is of food or not of food, a model could assign a prediction probability of the image being 0.98 food and 0.02 not food. Prediction probabilities do not indicate how right a prediction is, more so, how confident a model is in that prediction. The closer a prediction probability to 1, the higher the model’s confidence in the prediction. A good evaluation step is to inspect samples with low prediction probabilities (the model seems to get confused on them) or inspect test samples where the model has a high prediction probability but the prediction is wrong (these predictions are often referred to as most wrong predictions).
TK - logits - the raw outputs of a model
TK - Softmax function - an activation function which can be applied to logits to get prediction probabilities.