Recurrent Neural Networks (RNNs) 1

The Simple Explanation¶

Imagine a standard neural network as a machine with no memory. If you show it a picture of a cat, and then a picture of a dog, it forgets all about the cat as soon as it sees the dog. It treats every input as a brand new, isolated event.

A Recurrent Neural Network (RNN) is different because it has memory. It's designed with a 'loop' that allows information to persist from one step to the next.

Here’s the core idea:

When an RNN processes an input (like the first word in a sentence), it produces an output.
Crucially, it also saves a piece of information about that input—a 'hidden state' or memory—and passes it along to the next step.
When it processes the second word, it looks at both the new word and the memory from the first word. This continues for the entire sequence.

This ability to remember past information makes RNNs perfect for tasks where context and order are important, such as:

Predicting text: Your phone's keyboard suggests the next word based on the words you've already typed.
Language translation: The meaning of a word often depends on the rest of the sentence.
Speech recognition: Understanding a spoken word requires knowing the sounds that came before it.

An Analogy¶

Think of reading a book. You understand the sentence on page 50 because you remember the characters and plot developments from the first 49 pages. Your brain doesn't start from scratch with every new word; it maintains the context. An RNN is like this reader—it processes information sequentially, and its memory of what came before helps it understand what's happening now. A standard network would be like reading each sentence with amnesia, making it impossible to follow the story.

Key Takeaways¶

RNNs are neural networks with a form of memory, allowing them to remember past information.
They are specifically designed to work with sequential data, like text, speech, or time-series data.
The network's output at any given time depends on both the current input and the information it has learned from previous inputs.
The 'recurrent' part refers to the loop where the output from a step is fed back as an input to the next step.

What you've understood perfectly:¶

* The concept of memory: You've nailed the most important idea—that RNNs are unique because they have a way to "remember" past information. This is the key differentiator from other network types.

* The use case for sequential data: You correctly identified that their strength lies in handling sequences, and your examples of text, speech, and time-series data are spot on.

* The dependency on past and present: Your point that the output depends on both the current input and what came before is a perfect description of the network's behavior.

You are very close with your last point about the loop. Let's refine that just a tiny bit to make your understanding even more precise.

You wrote: "The 'recurrent' part refers to the loop where the output from a step is fed back as an input to the next step."

This is almost right, but there's a subtle and important distinction. It's not the final output (the prediction) that gets fed back, but rather the network's hidden state.

Think of it like this:

Imagine you're reading a book.

The input is the new word you are currently reading.
The hidden state is your understanding of the story so far—the context, the characters, the plot. It's your internal memory.
The output might be your prediction of what the next word will be.

When you read a new word (the input), you combine it with your existing understanding of the story (the hidden state from the previous step). This creates an updated understanding (the new hidden state).

So, the loop in an RNN works like this:

The network processes the current input (e.g., a word) and the hidden state from the previous step.
It produces two things:
An output for the current step (e.g., a prediction).
A new hidden state that captures the updated context.
This new hidden state is the "memory" that gets passed to the very next step in the sequence.

So, the recurrent loop is about passing the memory (hidden state) forward, not the final prediction (output). This allows the network to maintain a running summary of everything it has seen so far.

1. The Core Math: The Recurrence Relation¶

Your "loop" is defined by a mathematical formula. At any time step \(t\), the "memory" (hidden state) is calculated using the previous memory and the current input.

A simple RNN cell performs this calculation:

\[h_t = \tanh(W_{hh} \cdot h_{t-1} + W_{xh} \cdot x_t + b_h)\]

Let's break that down:

\(h_t\): This is the new hidden state (the "memory") at the current time step \(t\).
\(h_{t-1}\): This is the old hidden state from the previous time step \(t-1\).
\(x_t\): This is the input vector at the current time step \(t\) (e.g., the vector for the word "cat").
\(W_{hh}\): This is the hidden-to-hidden weight matrix. It's the "weight of memory" — it determines how much influence the previous memory should have on the new one.
\(W_{xh}\): This is the input-to-hidden weight matrix. It's the "weight of new information" — it determines how much influence the current input should have.
\(b_h\): This is the hidden bias vector, a constant that helps shift the values.
\(\tanh\): This is the activation function (hyperbolic tangent). It squashes the resulting values to be between -1 and 1, which helps keep the numbers stable.

The most important concept: The weight matrices (\(W_{hh}\) and \(W_{xh}\)) are the same for every single time step. The network doesn't learn one set of weights for the first word and a different set for the second. It learns one shared set of rules that are applied over and over. This is called parameter sharing, and it's what allows an RNN to handle sequences of any length.

2. How an RNN Actually Makes a Prediction¶

The hidden state \(h_t\) is the network's internal "memory" or "thought." It's not usually the final prediction. To get a prediction, you feed this hidden state through one more layer (often a "fully connected" layer):

\[y_t = W_{hy} \cdot h_t + b_y\]

\(y_t\): This is the final output at time step \(t\) (e.g., the probability distribution for the next word).
\(W_{hy}\): This is the hidden-to-output weight matrix. It learns how to translate the network's internal "thought" \(h_t\) into a useful prediction.
\(b_y\): The output bias vector.

3. How an RNN Learns: Backpropagation Through Time (BPTT)¶

This is the most critical technical challenge.

"Unrolling": You can visualize the RNN's "loop" by "unrolling" it in time. If you have a 5-word sentence, you can picture the RNN as a 5-layer network. The first layer processes word 1, its output (the hidden state) is fed to the second layer, which processes word 2, and so on.
The Problem: The network makes a prediction at step 5 (\(y_5\)). To calculate the error, you compare \(y_5\) to the true answer. This error gradient must then be "backpropagated" not just down to \(h_5\), but also back in time to step 4, step 3, step 2, and step 1.
Why? The network needs to learn how the input at step 1 (\(x_1\)) contributed to the error at step 5 (\(y_5\)).

4. The Fundamental Flaw: Vanishing & Exploding Gradients¶

This is the "technical depth" that explains why simple RNNs (the one I just described) are rarely used today.

When you backpropagate through time (BPTT), you are repeatedly multiplying the error gradient by the same weight matrix (\(W_{hh}\)) over and over.

Vanishing Gradients (The Big Problem):

If the values in \(W_{hh}\) are small (e.g., less than 1), multiplying them many times makes the gradient shrink exponentially. By the time the gradient from step 5 gets back to step 1, it's practically zero.
- Consequence: The network is physically unable to learn connections between distant words. It can't learn "long-range dependencies." It might learn that "San" is followed by "Francisco," but it can't learn that "I grew up in France... therefore I speak fluent French." The gap is too long.
Exploding Gradients (Easier to solve):

If the values in \(W_{hh}\) are large (e.g., greater than 1), the gradient grows exponentially and becomes a massive number (NaN). This "explodes" the training, and the network's weights become useless.
- Solution: This is easily fixed with gradient clipping (if the gradient is > 5, just set it to 5).

5. The Solution: Gated Cells (LSTM & GRU)¶

Because of the vanishing gradient problem, simple RNNs have been replaced by more advanced cells that are specifically designed to manage memory over long distances.

You should learn these next:

LSTM (Long Short-Term Memory): This is the most famous one. It doesn't just have one hidden state. It has two:
1. Cell State (\(c_t\)): A "long-term memory" conveyor belt. It's very easy for information to just flow along this belt unchanged.
2. Hidden State (\(h_t\)): A "short-term working memory" based on the cell state.
3. LSTMs use "gates" (small neural networks with sigmoid activations) to meticulously control the memory. At each step, the LSTM uses its gates to decide:
  - Forget Gate: What parts of the long-term memory to erase.
  - Input Gate: What parts of the new information to add to the long-term memory.
  - Output Gate: What parts of the long-term memory to reveal as the short-term hidden state.
GRU (Gated Recurrent Unit): A simpler, more modern version of the LSTM. It combines the cell and hidden states into one and uses fewer gates. It's often just as effective and trains faster.

These gated architectures are the real technical answer to the "memory" problem. They create pathways where gradients can flow over long distances without vanishing, allowing them to learn those crucial long-range dependencies.