Note to self: Need to revisit this topic after gaining better understanding on "context vector"

RNN without vs. with Attention

RNN without Attention (Basic Seq2Seq Model)

A simple RNN-based sequence-to-sequence (Seq2Seq) model consists of:

  • Encoder: Reads the input sequence (German sentence) and compresses it into a single fixed-length vector (context vector).
  • Decoder: Takes this fixed vector and generates the output sequence (English translation).

Example Translation:

  • German: 👉 “Das ist ein schönes Haus.”
  • English (Expected Output): 👉 “This is a beautiful house.”

Step-by-Step Process (Without Attention)

  1. Encoding:
    • The encoder processes each word of the input sentence using an RNN (e.g., LSTM or GRU).
    • After processing all words, it outputs a context vector (a single fixed-length hidden state).
    Example:
    Input: ["Das", "ist", "ein", "schönes", "Haus"]
    Encoder RNN hidden states: h1 → h2 → h3 → h4 → h5
    Final context vector = h5
    
  2. Decoding
    • The decoder starts with the context vector, and generates the output sequence once word at a time.
    • It does so without looking back at the individual encoder states - it only has access to the single final context vector, h5
    Example:
    Decoder RNN starts with context vector h5
    Generates: ["This"] → ["is"] → ["a"] → ["beautiful"] → ["house"]
    

Limitations of RNN Without Attention

The fixed-length context vector forces the model to compress all information into a single vector, which makes translation difficult, especially for long sentences. • If the sentence is long, earlier words might be forgotten by the time the decoder starts generating words.


RNN with Attention (Seq2Seq + Attention)

To overcome the limitations, Attention allows the decoder to dynamically focus on different parts of the input sentence at each decoding step.

Key Difference:

Instead of relying only on a single fixed context vector, the decoder attends to different encoder states using a set of learned weights.

Step-by-Step Process (With Attention)

  1. Encoding (Same as Before) • The encoder processes each input word and generates a sequence of hidden states, rather than a single context vector.

    Example

	Input: ["Das", "ist", "ein", "schönes", "Haus"]
	Encoder RNN hidden states: h1, h2, h3, h4, h5
  1. Decoding with Attention: • Instead of using just one context vector, at each decoding step, the decoder assigns weights to different encoder hidden states based on their relevance.

    Example (for generating “beautiful”): • The decoder generates an attention score for each encoder state. • The weighted sum of these encoder states becomes the context vector.

	Attention scores for "beautiful":
	(Das: 0.1, ist: 0.1, ein: 0.2, schönes: 0.5, Haus: 0.1)
	Weighted context vector = (0.1*h1 + 0.1*h2 + 0.2*h3 + 0.5*h4 + 0.1*h5)
  1. Final Output with Attention:
	Decoder generates:
	"This" → "is" → "a" → (focuses on "schönes") "beautiful" → (focuses on "Haus") "house"

Comparison: RNN Without vs. With Attention

FeatureRNN Without AttentionRNN With Attention
Context VectorSingle fixed-length vectorDynamically weighted sum of encoder states
Performance on long sentencesStruggles, loses early informationBetter, selectively focuses on important words
AccuracyLower forlong sentencesHigher, better word alignment
InterpretabilityHard to understand decisionsCan visualize attention weights to see which words were focus on

Comparison: Mermaid diagrams

Seq2Seq Without Attention

graph TD;
    A[Input: Das ist ein schoenes Haus] -->|Tokenized| B[Encoder RNN];
    B -->|Hidden states| C[Final Context Vector];
    C -->|Passed to| D[Decoder RNN];
    D -->|Generates words one by one| E[Output: This is a beautiful house];

Seq2Seq With Attention

graph TD;
    A[Input: Das ist ein schoenes Haus] -->|Tokenized| B[Encoder RNN];
    B -->|Generates hidden states| C1[h1] & C2[h2] & C3[h3] & C4[h4] & C5[h5];

    subgraph Attention Mechanism
        C1 -->|Weighted sum| AT1;
        C2 -->|Weighted sum| AT2;
        C3 -->|Weighted sum| AT3;
        C4 -->|Weighted sum| AT4;
        C5 -->|Weighted sum| AT5;
    end

    AT1 -->|Focused context for word 1| D1[Decoder Step 1: This];
    AT2 -->|Focused context for word 2| D2[Decoder Step 2: is];
    AT3 -->|Focused context for word 3| D3[Decoder Step 3: a];
    AT4 -->|Focused context for word 4| D4[Decoder Step 4: beautiful];
    AT5 -->|Focused context for word 5| D5[Decoder Step 5: house];

    D1 --> D2 --> D3 --> D4 --> D5;
    D5 --> E[Final Output: This is a beautiful house];

Conclusion

Without attention, the model must encode the entire sentence into a single vector, which can cause loss of information.
With attention, the model can selectively focus on the relevant parts of the input sentence at each decoding step, leading to better translations, especially for long sentences.