Term | Definition | Example/Notes |
---|---|---|
Architecture | This is the skeleton of the model — the definition of each layer and each operation that happens within the model. | For example, BERT is an architecture while bert-base-cased , a set of weights trained by the Google team for the first release of BERT, is a checkpoint. However, one can say “the BERT model” and “the bert-base-cased model.” |
Attention layers | This layer will tell the transformer model to pay specific attention to certain words in the sentence you passed it (and more or less ignore the others) when dealing with the representation of each word. | |
Auto-encoding transformer models | Also known as encoder models | e.g., BERT like |
Auto-regressive (AR) model | An auto-regressive model is a type of model used in time series analysis and natural language processing that predicts future values based on past values. Breakdown of the Term: - Auto: Refers to “self” or “same” (self-referrential). - Regressive: Refers to regression, which is a statistical method for predicting a value. Key characteristics in the context of time series analysis: - An AR model predicts future values in a time series using a linear combination of past values. - The model assumes that the current value of series is a function of previous values. Key characteristics in the context of language models: 1. Predicts one word (or token) at a time. 2. Uses previous/past outputs as inputs for future predictions. - This process continues iteratively until the model generates the desired length of text or reaches a stopping criteria. It’s designed for unidirectional, left-to-right processing. | Visual example:<br>Input: "The cat is"<br>Step 1: Predict "on" -> "The cat is on"<br>Step 2: Predict "the" -> "The cat is on the"<br>Step 3: Predict "mat" -> "The cat is on the mat"<br> |
Auto-regressive transformer models | A broad category of models that generate sequences step-by-step, predicting the next element/token based on previous elements only. The model’s output at each step depends only on past outputs and not on future tokens. Transformer decoder models are a specific type of model within the Transformer architecture. When used in an auto-regressive manner (like in GPT models), they generate text one token at a time, predicting the next token based on previously generated tokens. So, they (transformer decoder) act as a subtype of auto-regressive models in this context (like text generation). A standalone Transformer Decoder can work as an auto-regressive model, which is how it’s used in models like GPT, where there’s no need for an encoder because the task is purely generative (predicting text). | This category includes various architectures beyond Transformers (e.g., language models like GPT), like RNNs or LSTMs when they’re used in a sequential, generative manner. |
Bi-directional attention | Bi-directional attention, a characteristic of encoder models, is the result of applying self-attention in both forward (right/succeeding/future) and backward (left/preceding/past) directions, enabling each word to be influenced by all other words in the sentence. This is in contrast to uni-directional attention where they only consider context in the backward (left/preceding/past) directions. | In the phrase “It was a bright sunny day,” the word “sunny” can attend to “bright” and “day” due to self-attention, achieving a bi-directional context that enriches its meaning. This bi-directional influence is essential for accurate tasks like named entity recognition and extractive question answering. |
Bidirectional Encoder Representations (BERT) | ||
Byter Pair Encoding (BPE) | ||
Casual Effect ^casual-effect | Refers to the direct impact that one variable (e.g., surge pricing) has on another variable (e.g., Book Through Rate), isolating this relationship from other influencing factors (e.g., day of time, weather, seasonal trends, etc.). A model is causally correct if it accurately captures and reflects the true cause-and-effect relationship between variables, ensuring that the observed effects are genuinely due to the changes in the variable of interest and not confounded by other factors. | |
Casual Language Modelling (CLM) | The output depends on the past and present inputs, but not the future ones. Casual Language Modeling (CLM) is a technique that focuses on predicting the next word in a sequence based on previous words, which is often used in Natural Language Generation (NLG) tasks. While CLM is a method within NLG, NLG encompasses a broader range of tasks, including generating coherent text, summaries, and more from given prompts. So, while they are related, they are not synonymous; CLM is a specific approach used for certain types of NLG applications. | e.g., Predicting the next word in a sentence having read the n previous words |
Chain-of-though (CoT) Prompting | A prompting technique that enables complex reasoning capabilities throught interemediate reseaoning steps | |
Checkpoints | These are the weights that will be loaded in a given architecture. | For example, BERT is an architecture while bert-base-cased , a set of weights trained by the Google team for the first release of BERT, is a checkpoint. However, one can say “the BERT model” and “the bert-base-cased model.” |
Content Generation | Crafting creative or informative text based on prompts. | GPT-4 - Generates creative content and stories. |
Cross-attention | In some architectures, decoders can utilize cross attention to incorporate information from encoder outputs, allowing the model to consider both the generated sequence and the context provided by an encoder. Cross attention is a mechanism that applies specifically to encoder-decoder models, such as those used in translation tasks, where the decoder needs to reference the encoder’s outputs for context. In contrast, encoder-only models (like BERT) do not utilize cross attention, as they only process the input sequence without generating new output based on preceding context. Encoder-only models rely on self-attention mechanisms to create embeddings for the input data. | |
Data-to-Text Generation | Translating structured data (e.g., tables or graphs) into natural language. | T5 or Data2Text Transformers - Converts structured data into narrative text. |
Decoder Models | A specific type of model within the Transformer architecture. The decoder uses the encoder’s representation (features) along with other inputs to generate a target sequence. This means that the model is optimized for generating outputs. | e.g., CTRL, GPT,GPT-2, Transformer XL. When used in an auto-regressive manner (like in GPT models), they generate text one token at a time, predicting the next token based on previously generated tokens. Decoder-only models are good for generative Tasks such as - Text generation |
Decorators | A powerful and useful tool in python which allows it to modify the behavior of the function or a class. In other words, decorators wrap functions to extend their behavior Common uses: - Timing/logging - Authentication - Caching - Input validation | |
Dialogue Generation | Creating responses for chatbots and conversational agents. | DialoGPT - Generates conversational responses. |
DistilBERT | ||
ELECTRA | Pre-training Text Encoders as Discriminators Rather Than Generators. ELECTRA is a new pretraining approach which trains two transformer models: the generator and the discriminator. The generator’s role is to replace tokens in a sequence, and is therefore trained as a masked language model. The discriminator, which is the model we’re interested in, tries to identify which tokens were replaced by the generator in the sequence. In other words, we train the model to detect whether words in a sentence are real or replaced, adding robustness to the model’s understanding of language. | |
Embeddings | Process of converting text or other non-numerical data into a numerical vector representation. | |
Emergent Abilities | Ability to perform tasks the model was not explicitly trained to perform. | e.g., Though the GPT models were not specifically trained to perform translation task (unlike the original transformers), and rather primarily trained on next-word prediction task, it emerges as a natural consequence of the model’s exposure to vast quantities of multilingual dataset in diverse contexts. |
Encoder Models | Encoders are very powerful at extracting vectors that carry meaningful information of the sequence. These models are often characterized as having “bi-directional” attention because they consider both the preceding and succeeding context of each word in a sentence. This allows the model to fully grasp context by examining relationships in both directions, enhancing comprehension of ambiguous or polysemous terms. The self-attention mechanism helps extract feature vectors by weighing each word’s relevance to others in the sequence, thus capturing dependencies. This approach highlights contextually important words for each token, creating richer, more informative representations. Encoder models are also called as auto-encoding models. Encoder models typically form the first half of the Transformer architecture, designed to “encode” the entire input sequence into contextualised embeddings. The encoder receives an input and builds a representation of it (its features). This means that the model is optimized to acquire understanding from the input. | e.g., ALBERT, BERT, DistilBERT, ELECTRA, RoBERTa. Encoder-only models are good for Tasks that require understanding of the inputs, and the relationship between them, such as: - Sentence classification - Named Entity Recognition - Extractive question answering |
Encoder-Decoder Models | e.g., BART, T5, Marian, mBART, ProphNet, mT5, Pegasus, M2M100. Encoder-decoder models are good for generative Tasks that require an input, such as: - Summarization - Translation - Generative question answering | |
Fine-tuning | A technique within transfer learning. Training is done after a model has been pretrained. | To perform fine-tuning, we first acquired pre-trained language model, then perform additional training with a dataset specific to our task. The fine-tuning will only require lesser amount of data: : the knowledge the pretrained model has acquired is “transferred,” hence the term transfer learning |
FLOPS | Floating Point Operations Per Second. A standard measure of a computer’s or GPU’s computational performance, especially for tasks that require high numerical precision, such as scientific calculations, AI training, or gaming graphics. | FLOPS can be measured in various scales: • GigaFLOPS (GFLOPS): Billion () operations per second. • TeraFLOPS (TFLOPS): Trillion () operations per second. • PetaFLOPS (PFLOPS): Quadrillion () operations per second. |
Foundation model | The first training stage of an LLM is also known as pre-training, creating an initial pretrained LLM, often called, base or foundation model. | Example includes - GPT-3 model (The precursor of the original model offered in ChatGPT) |
Function Calling | Name give the LLMs capability of forming a string containing a function call or structure needed to make a function call. ‘Tools’ are the functions being called. | |
General Purpose LLMs | Respond to all types of queries, including function calling. This is in contrast to Special Purpose LLMs. | OpenAI GPT-x, Gemini, Mistral, etc. |
Generative Pre-trained Transformer (GPT) | ||
Intent Recognition | Identifying user intent in commands or queries. | Dialogflow Intent Recognizer - Used in virtual assistants for identifying user intentions. |
Masked AutoEncoder (MAE) | ||
Masked Language Modelling (MLM) | Model predicts a masked (missing or hidden) word in the sentence based on the surrounding context. | |
Masked Self-Attention | This mechanism allows the decoder model to focus on previous words while generating the next word. In masked self-attention, the model is restricted from accessing future words in the sequence by applying a mask, ensuring that the predictions depend only on past and current words. | |
Model | This is an umbrella term that isn’t as precise as “architecture” or “checkpoint”: it can mean both. | For example, BERT is an architecture while bert-base-cased , a set of weights trained by the Google team for the first release of BERT, is a checkpoint. However, one can say “the BERT model” and “the bert-base-cased model.” |
Model Serving | ||
Named Entity Recognition (NER) | Identifying and classifying entities like names, places, and dates. | spaCy’s Named Entity Recognizer - Identifies entities like names, organizations, and dates. |
Natural Language Generation (NLG) | ||
Natural Language Processing (NLP) | Natural Language Processing (NLP) | |
Natural Language Understanding (NLU) | ||
OpenAPI Specification | The OpenAPI specification is a standard, language-agnostic format for describing RESTful APIs. It uses JSON or YAML to define the structure, capabilities, and requirements of an API, enabling both humans and machines to understand and interact with ti effectively. | |
Out of Vocabulary (OOV) | A term used in natural language processing (NLP) to describe words or tokens that are not part of the vocabulary that a model was trained on. | Models like Word2Vec or GloVe rely on a fixed vocabulary built during training. If a word appears in new data that wasn’t seen during training (a.k.a OOV), the model doesn’t have a pre-trained embedding for it. This typically occurs with: • Rare words • Misspellings • New jargon or terminology introduced after the model was trained. |
Paraphrasing | Restating content in a different way. | PEGASUS - Designed for text rewriting and paraphrasing. |
Pre-training | Act of training a model from scratch; the weights are randomly initialized, and the training starts without any prior knowledge. | |
Principal Component Analysis (PCA) | Principal Component Analysis (PCA), a technique to reduce dimensionality of data | Often used before clustering to reduce noise |
Prompting Techniques | ||
Psuedo-Labels | Automatically generated labels from the data itself. These labels are not manually annotated but are inferred based on the inherent structure or attributes of the data. In other words, model automatically generates the labelled data. Example in practice: 1. NLP: BERT: , GPT: Predicts the next word in a sequence (in CLM) 2. Vision: SimCLR, MAE 3. Speech: Wav2Vec | Example of a Masking/Prediction Tasks: In NLP, when training a model like BERT, random words in a sentence are masked. The model is trained to predict these masked words using the context (MLM). • Input: “The cat is ___ the table.” • Pseudo-label: “on.” |
Pydantic | Pydantic is the most widely used data validation library for Python. For mode details, refer official website | |
RESTful API | ||
Recurrent Neural Network (RNN) | ||
RoBERTa | ||
Self-attention | Self-attention is the mechanism that lets each word in a sentence focus on every other word, assessing their relationships to form a context-aware representation. In other words, self-attention mechanism enables the model to weight the importance of each word in a sequence relative to all other other words, thus allowing the model to capture the long-range contextual relationships and dependencies. Both encoders and decoders consists of many layers connected by self-attention mechanism. | In the phrase “It was a bright sunny day,” the word “bright” can attend to “sunny” and “day” due to self-attention, achieving a bi-directional context that enriches its meaning. This bi-directional influence is essential for accurate tasks like named entity recognition and extractive question answering. |
Self-supervised learning | A type of learning in which the objective is automatically computed from the inputs of the model. In other words, a type of machine learning where models learn to predict parts of data from other parts, without labels. That means humans are not needed to label the data. | Often used in natural language processing (NLP) and computer vision to pre-train models on large datasets. Transformer models like GPT, BERT, BART, T5, etc. have been trained as language models on large amount of raw data in a self-supervised fashion. This type of model develops a statistical understanding of the language it has been trained on, but it’s not very useful for specific practical tasks. Because of this, the general pretrained model then goes through a process called transfer learning. |
Semantic Parsing | Converting language into structured data, often for databases or code. | T5 - Converts language into structured data (e.g., SQL queries) |
Sentiment Analysis | Detecting the emotional tone of text.Identifying and classifying entities like names, places, and dates. | BERT-based Sentiment Classifier - Analyzes text for sentiment polarity. |
Sequence-to-sequence transformer models | e.g., BART/T5 like | |
Special Purpose LLMs | Highly trained to focus on a single or small set of tasks. This is in contrast to General Purpose LLMs | Raven-13B - Tuned to provide function calling services - Smaller & lower latency than general purpose LLMs |
Summarization | Producing concise summaries of longer content | BART - Generates concise summaries for long texts. |
Text Classification | Categorizing text into predefined labels. | |
Text Completion | Generating the continuation of a given text. | GPT-3 - Completes text based on context. |
Transfer learning | A technique where a model developed for one task is reused as the starting point for a model on a second, related task. In other words, a process in which a (general) pretrained model is fine-tuned in a supervised way - that is, using human-annotated labels - on a given task. The principle of transfer learning is to leverage knowledge from one domain (source task) to improve learning in another (target task). | Often used in NLP and computer vision, where pre-trained models (e.g., BERT, ResNet) are fine-tuned for new tasks. |
Transformer Model | A deep neural network (DNN) architecture primarily used for natural language processing (NLP) tasks. Transformer models rely on attention mechanisms to process data, enabling parallelization and improved performance on sequential data. | Widely used in NLP tasks like translation, text generation, and question-answering; includes models like BERT and GPT. |
Uni-directional attention | Decoder models operate in a uni-directional manner, meaning they only consider the context from the left (previous words) when making predictions. They do not access future words, in contrast to the bidirectional attention used in encoder models. |