Transformer Decoder Flow Chart

Of course. This is an excellent way to solidify the concepts. We will walk through the architecture of a GPT-2 model, which is a Decoder-only Transformer. This means it doesn't have the cross-attention mechanism that consults an Encoder; its entire job is to predict the next word based on the text it has seen so far.

The Scenario¶

Model: GPT-2 (small version).
Analogy: An Autoregressive Poet, trying to complete a line of poetry.
Demo Input: The poet has written "The quick brown fox jumps over the lazy". The model receives the token IDs for these 9 words.
Goal: To calculate the most likely 10^th word ("dog").

GPT-2 (Small) Dimensions¶

vocab_size: 50,257 (The number of words in its dictionary)
d_model: 768 (The "width" or embedding size of each vector)
n_layer: 12 (The number of identical Decoder blocks stacked on top of each other)
n_head: 12 (The number of attention "specialists" in each block)
d_head: 64 (The dimension of each specialist's vectors, since 768 / 12 = 64)
seq_len: 9 (The length of our input sentence)
batch_size: 1 (We are processing one sentence at a time)

Flow Chart of a Single GPT-2 Decoder Block¶

This is the assembly line for processing the input text. This entire block will be repeated 12 times.

      Input for this Block (Shape: [1, 9, 768])
                  |
                  |
┌─────────────────▼─────────────────┐
│   STATION 1: Self-Review          │
│   Masked Multi-Head Attention     │
│   (Poet re-reads their own line   │
│    with blinders on to find       │
│    internal connections)          │
└─────────────────┬─────────────────┘
                  |
                  |
┌─────────────────▼─────────────────┐
│   CHECKPOINT A: Stabilize         │
│   Add & Layer Normalization       │
│   (Staple the original text to    │
│    the new insights & standardize)│
└─────────────────┬─────────────────┘
                  |
                  |
┌─────────────────▼─────────────────┐
│   STATION 2: Deep Thinking        │
│   Feed-Forward Network            │
│   (Poet thinks deeply about each  │
│    word's new context, alone)     │
└─────────────────┬─────────────────┘
                  |
                  |
┌─────────────────▼─────────────────┐
│   CHECKPOINT B: Stabilize         │
│   Add & Layer Normalization       │
│   (Staple the previous version to │
│    the new thoughts & standardize)│
└─────────────────┬─────────────────┘
                  |
                  |
                  ▼
      Output of this Block (Shape: [1, 9, 768])
      (Ready to be the input for the next block)

Matrix Dimensions and Flow for the Entire Model¶

Here is the detailed, step-by-step journey of the data, showing the shape of the matrix at each stage.

Step 0: The Input¶

The model receives the token IDs for the sentence. * Input_Tokens: [1, 9]

Step 1: Embedding (The Foundation)¶

We convert the token IDs into rich vectors. * Input: Input_Tokens [1, 9] * Operation: Token Embedding + Positional Encoding * Matrices: * W_E (Token Embedding Matrix): [50257, 768] * W_P (Positional Encoding Matrix): [1024, 768] (GPT-2's max sequence length is 1024) * Process: 1. Lookup in W_E -> Token_Embeddings [1, 9, 768] 2. Add first 9 rows of W_P -> Input_Embeddings [1, 9, 768] * Output (X_0): [1, 9, 768]

Step 2 to N: The 12 Decoder Blocks¶

The X_0 matrix now goes through the 12-block assembly line. Let's trace the journey through Block 1.

2a. Masked Multi-Head Attention (Station 1) * Input: X_0 [1, 9, 768] * Matrices: * W_Q1, W_K1, W_V1 (Query, Key, Value for Block 1): Each effectively [768, 768] * W_O1 (Output Projection for Block 1): [768, 768] * Process: 1. Q, K, V = X_0 @ W_Q1, X_0 @ W_K1, X_0 @ W_V1 2. Split into 12 heads -> Q, K, V are now [1, 12, 9, 64] 3. Scores = Q @ K^T -> [1, 12, 9, 9] 4. Apply Look-Ahead Mask (sets upper triangle to -infinity) 5. Scale: Scores / sqrt(64) 6. Attention_Weights = softmax(Scores) -> [1, 12, 9, 9] 7. Weighted_Values = Attention_Weights @ V -> [1, 12, 9, 64] 8. Concatenate heads -> [1, 9, 768] 9. Attention_Output_1 = Weighted_Values @ W_O1 -> [1, 9, 768]

2b. Add & Norm 1 (Checkpoint A) * Input: X_0 [1, 9, 768] and Attention_Output_1 [1, 9, 768] * Process: X_1 = LayerNorm(X_0 + Attention_Output_1) * Output (X_1): [1, 9, 768]

2c. Feed-Forward Network (Station 2) * Input: X_1 [1, 9, 768] * Matrices: * W_FF1_1 (First Linear Layer): [768, 3072] * W_FF1_2 (Second Linear Layer): [3072, 768] * Process: 1. Hidden = GELU(X_1 @ W_FF1_1) -> [1, 9, 3072] 2. FF_Output_1 = Hidden @ W_FF1_2 -> [1, 9, 768]

2d. Add & Norm 2 (Checkpoint B) * Input: X_1 [1, 9, 768] and FF_Output_1 [1, 9, 768] * Process: X_2 = LayerNorm(X_1 + FF_Output_1) * Output (X_2): [1, 9, 768]

This X_2 is the final output of Block 1. It becomes the input to Block 2, and the process repeats 11 more times until we get the final output from Block 12, X_final.

Step N+1: The Final Prediction (Pointing at the Dictionary)¶

The model has done all its thinking. Now it must make a choice. * Input: X_final [1, 9, 768] * Process: 1. We only care about predicting the next word, so we only need the information from the last token in our sequence (the vector for "lazy"). * Last_Token_Vector = X_final[:, -1, :] -> [1, 768] 2. This final vector is passed through one last linear layer to convert it into a score for every word in the dictionary. * Matrix: W_E^T (The transposed Token Embedding matrix is used here). Shape: [768, 50257] * Logits = Last_Token_Vector @ W_E^T -> [1, 50257] 3. The softmax function converts these raw scores into probabilities. * Probabilities = softmax(Logits) -> [1, 50257] * Output: A vector of 50,257 probabilities. The highest probability in this vector corresponds to the model's prediction. In this case, the index with the highest value would hopefully be the token ID for the word "dog".