Glossary
This is messy and a work in progress. Will tidy up later.
- Predictive vs generative AI = predictive AI -> machine readable outputs, generative AI -> human readable outputs. Predictive style models take in data and map it to a fixed output space (e.g. a text classification model predicting whether an email is spam or not). Generative AI models take in data and generate an unbounded response (though theorectically this response is bounded by the training distribution), such as, a chat system taking in natural language instructions and producing natural language as output. Generative AI models can be turned into predictive-style models, for example a generative LLM could produce JSON outputs if instructed/constrained to do so.
- Transformer = A deep learning model that adopts the attention mechanism to draw global dependencies between input and output
- Tokenization = turning a series of data (text or image) into a series of tokens, where a token is a numerical representation of the input data, for example, in the case of text, tokenization could mean turning the words in a sentence into numbers (e.g. “hello world” -> [101, 102])
- Tokens = a token is a letter, word or word-piece (word) that a model uses to represent input data, for example, in the case of text, a token could be a word (e.g. “hello”) or a word-piece (e.g. “hell” and “o”), see: https://platform.openai.com/tokenizer for an example
transformers
= A Python library by Hugging Face that provides a wide range of pre-trained transformer models, fine-tuning tools, and utilities to use themdatasets
= A Python library by Hugging Face that provides a wide range of datasets for NLP and CV taskstokenizers
= A Python library by Hugging Face that provides a wide range of tokenizers for NLP tasksevaluate
= A Python library by Hugging Face with premade evaluation functions for various taskstorch
= PyTorch, an open-source machine learning librarytransformers.pipeline
= an abstraction to get a machine learning pipeline up and running in a few lines of code, handles data preprocessing and device placement behind the scences. For example,transformers.pipeline("text-classification")
can be used to tokenize input text and classify it.- transfer learning = taking what one model has learned and applying it to another task (e.g. a model which has learned across many millions of words of text from the internet and then adjusting it to work with your smaller dataset)
- fine-tuning = a type of transfer learning where you take the existing patterns of one model (usually trained on a very large dataset) and customize them to work for your smaller dataset
- hyperparameters = values you can set to adjust training settings, for example, learning rate is a hyperparameter that is adjustable
- Hugging Face Hub (or Hub for short) = Place to store datasets, models, and other resources of your own + find existing datasets, models & scripts others have shared. If you are familiar with GitHub, Hugging Face is like the GitHub of machine learning.
- Hugging Face Spaces = A platform to share and run machine learning apps/demos, usually built with Gradio or Streamlit
- HF = Hugging Face
- NLP = Natural Language Processing
- CV = Computer Vision
- TPU = Tensor Processing Unit
- GPU = Graphics Processing Unit
- Learning rate = Often the most important hyperparameter to tune. It is proportional with the amount an optimizer will update a model’s parameters every update step. A higher amount means larger updates (though sometimes too large) a lower amount means smaller updates (though sometimes not enough). The most ideal learning rate is experimental. Common values include 0.001, 0.0001, 0.0005, 0.00001, 0.00005 (though the learning rate can be any value). Many optimizers have decent default learning rates. For example, the Adam optimizer (a common and generally well performing optimizer) in PyTorch (
torch.optim.Adam
) has a default learning rate of 0.001. For fine-tuning an already trained model a learning rate of 10x smaller than the default is a good rule of thumb (e.g. if a model was trained with a learning rate of 0.001, fine-tuning with 0.0001 is common). The learning rate does not have to be static and can change dynamcially during training, this practice is referred to as learning rate scheduling. - Inference = using a trained (or untrained) model to make predictions on a given piece of data. The model infers what the output should be based on the inputs. Inference is often much faster than training on a sample per sample basis because no weights get updated during inference. Though, when compared to training, inference can often take more compute in the long run. Because a model can be trained once but then used for inference millions of times (or more) over the next several months (or longer).
- Prediction probability = the probability of a model’s prediction for a given input, is a score between 0 and 1 with 1 being the highest, for example, a model may have a prediction probability of 0.95, this would mean it’s quite confident with its prediction but it doesn’t mean it’s correct. A good way to inspect potential issues with a dataset is to show examples in the test set which have a high prediction probability but are wrong (e.g. pred prob = 0.98 but the prediction was incorrect).
- Hugging Face Pipeline (
pipeline
) = A high-level API for using model for various tasks (e.g.text-classification
,audio-classification
,image-classification
,object-detection
and more), see the docs: https://huggingface.co/docs/transformers/v4.41.3/en/main_classes/pipelines#transformers.pipeline - loss value = Measure of how wrong your model is by a given metric. A perfect model will have a loss value of 0 (it is able to predict the data perfectly), though this is highly unlikely in practice (there are no perfect models). Ideally, the loss value will go down (towards 0) as training goes on. If the loss value on the training set is lower than the loss value on the test set, the model is likely overfitting (memorizing the training set too well rather than learning generalizable patterns to unseen data). To fix overfitting, introduce more regularization. To fix underfitting (loss not going down), introduce more learning capacity (more data, more parameters in the model, longer training). Machine learning is a constant battle between overfitting and underfitting.
- Random seed = Value to flavour the randomness of an operation. For example, if you set a random seed to
42
the numbers produced by a random generator will be random but flavoured by the seed. This means if the seed stays at42
, subsequent calls of the same operation will return the same values. Not setting a random seed will result in different random values each time. Setting a random seed is done to ensure reproducibility of an operation. This is helpful when performing experiments and you do not want the outputs to be random each time. - Synthetic data generation = using a model such as a generative Large Language Model (LLM) to generate synthetic pieces of data for a specific problem. For example, getting an LLM to generate food and not food image captions to create a binary text classification model. Synthetic data is very helpful when bootstrapping a machine learning problem. Though it is advised to only train on synthetic data and to evaluate on real data whenever possible.
- Pre-trained models = models which have already been trained on a large dataset, for example, text-based models which have gone through many millions of words of text (e.g. all of Wikipedia and 1000s of books) or image-based models which have seen millions of images (e.g. models trained on ImageNet). In essence, any model which has already spent a large amount of time learning patterns in data. These patterns can then be adjusted for your own sample problems, often with much much smaller amounts of data for excellent results. The process of customizing a pre-trained model for a specific problem is called transfer learning (transferring what an existing model knows to your own problem).
- Training/test split = One of the most important concepts in machine learning. Train models on the training data and evaluate them on the test data. The test data should never be seen by a model during training. Think of the test data as the final exam in a university course. A model should be able to learn enough patterns in the training set to perform well on the test set. Just like a student should be able to learn enough on course materials to do well on the final exam. If a model performs well on the training set but not well on the test set, this is known as overfitting, as in, the model memorizes the training set rather than learning generalizable patterns to unseen data. If a model performs poorly on both the training set and the test set, this is known as underfitting.
- Prediction probabilities = a value assigned to a model’s prediction on a certain sample after its output logits have passed through an activation function such as Softmax or Sigmoid. For example, in a binary classification problem of whether an image is of food or not of food, a model could assign a prediction probability of the image being 0.98 food and 0.02 not food. Prediction probabilities do not indicate how right a prediction is, more so, how confident a model is in that prediction. The closer a prediction probability to 1, the higher the model’s confidence in the prediction. A good evaluation step is to inspect samples with low prediction probabilities (the model seems to get confused on them) or inspect test samples where the model has a high prediction probability but the prediction is wrong (these predictions are often referred to as most wrong predictions).