Term	Definition	Example/Notes



Activation functions	Functions that enable NN to learn non-linear relationships between features and the label. For more details, refer google developers ML course, keras actviations
Architecture	This is the skeleton of the model — the definition of each layer and each operation that happens within the model.	For example, BERT is an architecture while `bert-base-cased`, a set of weights trained by the Google team for the first release of BERT, is a checkpoint. However, one can say “the BERT model” and “the `bert-base-cased` model.”
Psuedo-Labels	Automatically generated labels from the data itself. These labels are not manually annotated but are inferred based on the inherent structure or attributes of the data. In other words, model automatically generates the labelled data. Example in practice: 1. NLP: BERT: , GPT: Predicts the next word in a sequence (in CLM) 2. Vision: SimCLR, MAE 3. Speech: Wav2Vec	Example of a Masking/Prediction Tasks: In NLP, when training a model like BERT, random words in a sentence are masked. The model is trained to predict these masked words using the context (MLM). • Input: “The cat is ___ the table.” • Pseudo-label: “on.”
Pydantic	Pydantic is the most widely used data validation library for Python. For mode details, refer official website
Pydantic	Pydantic is the most widely used data validation library for Python. For mode details, refer official website
Rectified Linear Unit (ReLU)
Recurrent Neural Network (RNN)
Recurrent Neural Network (RNN)
Regularization and Regularization Rate ( $λ$ )	One approach to keeping a model simple is to penalize complex models; that is, to force the model to become simpler during training. Penalizing complex models is one form of regularization. i.e., Training optimization algorithm $minimi se (l oss + co m pl e x i t y)$ A regularization rate (lambda) controls the strength of regularization, with higher values leading to simpler models and lower values increasing the risk of overfitting. i.e., $minimi se (l oss + λ \times co m pl e x i t y)$ For more details, refer google developers Regularization types: - L1 regularization - L2 regularization A regularization rate (lambda) controls the strength of regularization, with higher values leading to simpler models and lower values increasing the risk of overfitting. A high regularization rate: - Strengthens the influence of regularization, thereby reducing the chances of overfitting. - Tends to produce a histogram of model weights having the following characteristics: - a normal distribution - a mean weight of 0. A low regularization rate: - Lowers the influence of regularization, thereby increasing the chances of overfitting. - Tends to produce a histogram of model weights with a flat distribution.
RESTful API
RESTful API
RoBERTa
RoBERTa
Self-attention	Self-attention is the mechanism that lets each word in a sentence focus on every other word, assessing their relationships to form a context-aware representation. In other words, self-attention mechanism enables the model to weight the importance of each word in a sequence relative to all other other words, thus allowing the model to capture the long-range contextual relationships and dependencies. Both encoders and decoders consists of many layers connected by self-attention mechanism.	In the phrase “It was a bright sunny day,” the word “bright” can attend to “sunny” and “day” due to self-attention, achieving a bi-directional context that enriches its meaning. This bi-directional influence is essential for accurate tasks like named entity recognition and extractive question answering.
Self-attention	Self-attention is the mechanism that lets each word in a sentence focus on every other word, assessing their relationships to form a context-aware representation. In other words, self-attention mechanism enables the model to weight the importance of each word in a sequence relative to all other other words, thus allowing the model to capture the long-range contextual relationships and dependencies. Both encoders and decoders consists of many layers connected by self-attention mechanism.	In the phrase “It was a bright sunny day,” the word “bright” can attend to “sunny” and “day” due to self-attention, achieving a bi-directional context that enriches its meaning. This bi-directional influence is essential for accurate tasks like named entity recognition and extractive question answering.
Self-supervised learning	A type of learning in which the objective is automatically computed from the inputs of the model. In other words, a type of machine learning where models learn to predict parts of data from other parts, without labels. That means humans are not needed to label the data.	Often used in natural language processing (NLP) and computer vision to pre-train models on large datasets. Transformer models like GPT, BERT, BART, T5, etc. have been trained as language models on large amount of raw data in a self-supervised fashion. This type of model develops a statistical understanding of the language it has been trained on, but it’s not very useful for specific practical tasks. Because of this, the general pretrained model then goes through a process called transfer learning.
Self-supervised learning	A type of learning in which the objective is automatically computed from the inputs of the model. In other words, a type of machine learning where models learn to predict parts of data from other parts, without labels. That means humans are not needed to label the data.	Often used in natural language processing (NLP) and computer vision to pre-train models on large datasets. Transformer models like GPT, BERT, BART, T5, etc. have been trained as language models on large amount of raw data in a self-supervised fashion. This type of model develops a statistical understanding of the language it has been trained on, but it’s not very useful for specific practical tasks. Because of this, the general pretrained model then goes through a process called transfer learning.
Semantic Parsing	Converting language into structured data, often for databases or code.	T5 - Converts language into structured data (e.g., SQL queries)
Semantic Parsing	Converting language into structured data, often for databases or code.	T5 - Converts language into structured data (e.g., SQL queries)
Sentiment Analysis	Detecting the emotional tone of text.Identifying and classifying entities like names, places, and dates.	BERT-based Sentiment Classifier - Analyzes text for sentiment polarity.
Sentiment Analysis	Detecting the emotional tone of text.Identifying and classifying entities like names, places, and dates.	BERT-based Sentiment Classifier - Analyzes text for sentiment polarity.
Sequence-to-sequence transformer models		e.g., BART/T5 like
Sequence-to-sequence transformer models		e.g., BART/T5 like
Sigmoid
Special Purpose LLMs	Highly trained to focus on a single or small set of tasks. This is in contrast to General Purpose LLMs	Raven-13B - Tuned to provide function calling services - Smaller & lower latency than general purpose LLMs
Special Purpose LLMs	Highly trained to focus on a single or small set of tasks. This is in contrast to General Purpose LLMs	Raven-13B - Tuned to provide function calling services - Smaller & lower latency than general purpose LLMs
Summarization	Producing concise summaries of longer content	BART - Generates concise summaries for long texts.
Summarization	Producing concise summaries of longer content	BART - Generates concise summaries for long texts.
Text Classification	Categorizing text into predefined labels.
Text Classification	Categorizing text into predefined labels.
Text Completion	Generating the continuation of a given text.	GPT-3 - Completes text based on context.
Text Completion	Generating the continuation of a given text.	GPT-3 - Completes text based on context.
Transfer learning	A technique where a model developed for one task is reused as the starting point for a model on a second, related task. In other words, a process in which a (general) pretrained model is fine-tuned in a supervised way - that is, using human-annotated labels - on a given task. The principle of transfer learning is to leverage knowledge from one domain (source task) to improve learning in another (target task).	Often used in NLP and computer vision, where pre-trained models (e.g., BERT, ResNet) are fine-tuned for new tasks.
Transfer learning	A technique where a model developed for one task is reused as the starting point for a model on a second, related task. In other words, a process in which a (general) pretrained model is fine-tuned in a supervised way - that is, using human-annotated labels - on a given task. The principle of transfer learning is to leverage knowledge from one domain (source task) to improve learning in another (target task).	Often used in NLP and computer vision, where pre-trained models (e.g., BERT, ResNet) are fine-tuned for new tasks.
Transformer Model	A deep neural network (DNN) architecture primarily used for natural language processing (NLP) tasks. Transformer models rely on attention mechanisms to process data, enabling parallelization and improved performance on sequential data.	Widely used in NLP tasks like translation, text generation, and question-answering; includes models like BERT and GPT.
Transformer Model	A deep neural network (DNN) architecture primarily used for natural language processing (NLP) tasks. Transformer models rely on attention mechanisms to process data, enabling parallelization and improved performance on sequential data.	Widely used in NLP tasks like translation, text generation, and question-answering; includes models like BERT and GPT.
Uni-directional attention	Decoder models operate in a uni-directional manner, meaning they only consider the context from the left (previous words) when making predictions. They do not access future words, in contrast to the bidirectional attention used in encoder models.
Uni-directional attention	Decoder models operate in a uni-directional manner, meaning they only consider the context from the left (previous words) when making predictions. They do not access future words, in contrast to the bidirectional attention used in encoder models.
Vanishing gradient	For more details, refer google developers ML course
Prompt Engineering	Engineering a prompt so that LLM does what we want. By better crafting our prompts, we can improve the quality of results.
Incidence	The occurrence, rate, or frequency of a disease, crime, or other undesirable thing. e.g., An increased incidence of cancer.

Additional Resources

https://developers.google.com/machine-learning/glossary

Thangavel PrasanthTP

Explorer

AI, ML, DS & SE Glossary

Additional Resources

Graph View

Backlinks