Most model LLMs rely on the transformer architecture, which is a DNN architecture introduced in the A Vaswani - Attention Is All You Need - 2017

Transformers are language models

All the Transformer models (such as GPT, BERT, BART, T5, etc.) have been trained as language models. This means they have been trained on large amounts of raw text in a self-supervised learning. Self-supervised learning is a type of training in which the objective is automatically computed from the inputs of the model. That means that humans are not needed to label the data!

This type of model develops a statistical understanding of the language it has been trained on, but it’s not very useful for specific practical tasks. Because of this, the general pre-trained model then goes through a process called transfer learning. During this process, the model is fine-tuned in a supervised way — that is, using human-annotated labels — on a given task.

Transformer are big models

General strategy: As the model’s sizes increases and the amount of data increases, the model achieves better performance Model parameters

Transformer timeline

Categories of transformer models

  1. Auto-regressive transformer models (GPT-like)
  2. Auto-encoding transformer models (BERT-like)
  3. Sequence-to-sequence transformer models (BART/T5 like)

General architecture of the transformer model

The transformer model is primarily composed of two blocks:

  • Encoder: The encoder receives an input and builds a representation of it (its features). This means that the model is optimized to acquire understanding from the input.
  • Decoder: The decoder uses the encoder’s representation (features) along with other inputs to generate a target sequence. This means that the model is optimized for generating outputs. Transformer blocks

Each of these parts can be used independently, depending on the task:

  • Encoder-only models
    • Good for tasks that require understanding of the input, such as sentence classification and named entity recognition.
  • Decoder-only models
    • Good for generative tasks such as text generation.
  • Encoder-decoder models or sequence-to-sequence models
    • Good for generative tasks that require an input, such as translation or summarization.

Attention layers

Attention layers

Attention is all you need!

A key feature of Transformer models is that they are built with special layers called attention layers. This layer will tell the model to pay specific attention to certain words in the sentence you passed it (and more or less ignore the others) when dealing with the representation of each word.

To put this into context, consider the task of translating text from English to French. With more complex sentences (and more complex grammar rules), the model would need to pay special attention to words that might appear farther away in the sentence to properly translate each word.

The same concept applies to any task associated with natural language: a word by itself has a meaning, but that meaning is deeply affected by the context, which can be any other word (or words) before or after the word being studied.

From ChatGPT

An attention layer in transformers is a mechanism that allows the model to focus on specific parts of an input sequence, enabling it to capture dependencies between different parts of the sequence. This attention mechanism is the core idea behind transformers and is especially useful for handling sequential data in tasks like natural language processing.

Here’s a breakdown of how it works and why it’s important:

1. Purpose of Attention

  • In tasks like language translation or text summarization, words in a sentence often relate to one another in complex ways. For instance, the meaning of a word may depend on other words far away in the sentence.
  • Traditional models like RNNs and LSTMs struggle with long-range dependencies, as they process sequences sequentially. Attention allows the model to capture these relationships more flexibly by directly “attending” to any part of the sequence.

2. How Attention Works

  • In the attention layer, each word (or token) in the sequence calculates a score for its relationship to every other word in the sequence.
  • These scores determine how much “attention” each word should give to other words in the context of understanding the current word.
  • For example, in the sentence “The cat sat on mat”, the word “cat” might give higher attention scores to related words like “sat” or “mat” rather than to “The” or “on.”
flowchart TB
     Links representing attention weights (higher score for more relevant words)
    cat -.->|"Low Attention (0.2)"| The
    cat -->|"High Attention (1.0)"| sat
    cat -->|"Moderate Attention (0.5)"| mat
    cat -.->|"Low Attention (0.1)"| on

3. Self-Attention (Scaled Dot-Product Attention)

Self-attention

Self-Attention is the specific form of attention used in transformers. Each word in the sequence attends to all other words in the sequence, including itself. In other words, it highlights relationships between each word and every other word, enhancing context comprehension.

Steps in Self-Attention (Not clear yet):

  1. Query, Key, and Value Vectors: Each word (or token) is transformed into three vectors: a query (Q), a key (K), and a value (V).
  2. Dot Product of Query and Key: The query of each word is compared to the keys of all other words using a dot product, creating attention scores.
  3. Scaled and Softmaxed Scores: The scores are scaled down (by the square root of the dimension) and then passed through a softmax to create a probability distribution.
  4. Weighted Sum of Values: These scores are used to compute a weighted sum of the value vectors, which represents the attention output for each word.
Link to original

Bi-directional attention

Bi-directional attention

Bi-directional attention, a characteristic of encoder models, is the result of applying self-attention in both forward and backward directions, enabling each word to be influenced by all other words in the sentence.

Link to original

Self-attention and bi-directional attention

Self-attention and bi-directional attention are closely linked in transformer encoder models. Self-attention is the mechanism that lets each word in a sentence focus on every other word, assessing their relationships to form a context-aware representation. Together, they allow the model to build a full-sentence understanding for richer contextual representations.

4. Multi-Head Attention

  • Transformers use multiple attention heads, each learning different aspects of word relationships. These heads run in parallel and are then combined, giving the model richer and more nuanced understanding.

5. Why Attention Layers Are Important

  • Parallelization: Unlike RNNs, transformers can process the entire sequence at once, making training faster and more efficient.
  • Long-Range Dependencies: Attention layers allow the model to capture relationships between distant words, which is critical in understanding context.
  • Versatility: The attention mechanism adapts well to various tasks, from translation and summarization to image processing.

In Summary

An attention layer in transformers, especially through self-attention and multi-head attention, enables the model to “focus” on relevant parts of the sequence, allowing it to understand relationships between words regardless of their distance from each other. This is a major breakthrough in making language models highly effective and efficient for complex tasks.

Mermaid diagrams

Attention Mechanism Overview

flowchart TD

A[Input Sequence] --> B[Embedding Layer]
B --> C[Self-Attention Layer]
C --> D[Multi-Head Attention]
D --> E[Feedforward Layer]
E --> F[Output Sequence]

Self-Attention Mechanism (Scaled Dot-Product Attention)

sequenceDiagram
    participant Input as Input Embeddings
    participant Query as Query (Q)
    participant Key as Key (K)
    participant Value as Value (V)
    participant DotProduct as Dot Product
    participant Scale as Scale
    participant Softmax as Softmax
    participant Output as Attention Output

    Input ->> Query: Generate Query (Q)
    Input ->> Key: Generate Key (K)
    Input ->> Value: Generate Value (V)
    Query ->> DotProduct: Dot Product of Q and K
    DotProduct ->> Scale: Scale by √d
    Scale ->> Softmax: Softmax
    Softmax ->> Output: Weighted Sum of V

Multi-Head Attention

flowchart TD
    A[Input Sequence] --> B[Self-Attention Head 1]
    A --> C[Self-Attention Head 2]
    A --> D[Self-Attention Head N]

    B --> E[Concatenate Heads]
    C --> E
    D --> E
    E --> F[Final Attention Output]
Link to original

The original transformer architecture

The Transformer architecture was originally designed for translation. During training, the encoder receives inputs (sentences) in a certain language, while the decoder receives the same sentences in the desired target language. In the encoder, the attention layers can use all the words in a sentence (since, as we just saw, the translation of a given word can be dependent on what is after as well as before it in the sentence). The decoder, however, works sequentially and can only pay attention to the words in the sentence that it has already translated (so, only the words before the word currently being generated). For example, when we have predicted the first three words of the translated target, we give them to the decoder which then uses all the inputs of the encoder to try to predict the fourth word.

To speed things up during training (when the model has access to target sentences), the decoder is fed the whole target, but it is not allowed to use future words (if it had access to the word at position 2 when trying to predict the word at position 2, the problem would not be very hard!). For instance, when trying to predict the fourth word, the attention layer will only have access to the words in positions 1 to 3.

The original Transformer architecture looked like this, with the encoder on the left and the decoder on the right:

Original transformer architecture

Note that the first attention layer in a decoder block pays attention to all (past) inputs to the decoder, but the second attention layer uses the output of the encoder. It can thus access the whole input sentence to best predict the current word. This is very useful as different languages can have grammatical rules that put the words in different orders, or some context provided later in the sentence may be helpful to determine the best translation of a given word.

The attention mask can also be used in the encoder/decoder to prevent the model from paying attention to some special words — for instance, the special padding word used to make all the inputs the same length when batching together sentences.

Architectures vs. Checkpoints

These terms all have slightly different meanings:

  • Architecture: This is the skeleton of the model — the definition of each layer and each operation that happens within the model.
  • Checkpoints: These are the weights that will be loaded in a given architecture.
  • Model: This is an umbrella term that isn’t as precise as “architecture” or “checkpoint”: it can mean both.

For example, BERT is an architecture while bert-base-cased, a set of weights trained by the Google team for the first release of BERT, is a checkpoint. However, one can say “the BERT model” and “the bert-base-cased model.”