Post

RNN

Q&A

What is RNN?

A Recurrent Neural Network (RNN) is a type of artificial neural network designed for processing sequences of data. Unlike traditional neural networks, RNNs have connections that form directed cycles, which allow them to maintain a 'memory' of previous inputs. This makes RNNs particularly well-suited for tasks where context and sequential information are important, such as:

  1. Natural Language Processing (NLP): Tasks like language modeling, translation, and sentiment analysis.
  2. Time Series Prediction: Predicting stock prices, weather forecasting, and other data that is collected over time.
  3. Speech Recognition: Converting spoken language into text.
  4. Video Analysis: Understanding and interpreting sequences of video frames.

Key Features of RNNs

  1. Sequential Data Processing: RNNs process input sequences one element at a time, maintaining an internal state that encodes information about previous elements.
  2. Memory of Past Inputs: The hidden state in RNNs allows them to retain information from past inputs, which is crucial for understanding context in sequences.
  3. Parameter Sharing: The same set of weights is used for all time steps, which makes RNNs efficient and able to generalize across different sequence lengths.

Basic Structure

gdsqerror
  • Input Layer: Takes the sequence data as input.
  • Hidden Layer: Processes the input sequence with the recurrent connections, maintaining an internal state that gets updated at each time step.
  • Output Layer: Produces the output at each time step or after the entire sequence has been processed.

Types of RNNs

  1. Vanilla RNN: The simplest form, where the current hidden state is a function of the previous hidden state and the current input.
  2. Long Short-Term Memory (LSTM): A variant designed to overcome the vanishing gradient problem, which allows the network to learn long-term dependencies.
  3. Gated Recurrent Unit (GRU): A simpler variant of LSTM with fewer parameters, designed to achieve similar performance with less computational overhead.
  4. Bidirectional RNN: Processes the input sequence in both forward and backward directions, allowing the network to have information from both past and future contexts.

Challenges

  • Vanishing/Exploding Gradients: During training, gradients can become very small or very large, making it difficult for the network to learn long-term dependencies.
  • Training Time: RNNs can be computationally intensive to train, especially for long sequences.
  • Complexity: Designing and tuning RNNs can be more complex compared to feedforward networks.

Applications

  • Language Modeling: Predicting the next word in a sentence.
  • Machine Translation: Translating text from one language to another.
  • Speech Synthesis: Generating human-like speech from text.
  • Handwriting Recognition: Recognizing handwritten characters and text.
  • Music Composition: Generating music sequences.

RNNs have been foundational in advancing the state of the art in many sequence-related tasks, though they are increasingly being complemented or replaced by architectures like Transformers, which can handle long-range dependencies more effectively.

What is Vanishing/Exploding Gradients?

What is LSTM?

How gradient become small?

In the context of training Recurrent Neural Networks (RNNs) using backpropagation through time (BPTT), the “vanishing gradient problem” refers to the phenomenon where gradients used to update the network’s weights become very small. This issue can hinder the network’s ability to learn long-term dependencies. Here’s a detailed explanation of why this happens:

Understanding Gradients in RNNs

In RNNs, the same set of weights is applied at each time step. During training, gradients of the loss with respect to these weights are calculated and used to update the weights. The gradients are computed through a process called backpropagation through time (BPTT), where the error is propagated backward through the network from the last time step to the first.

Mathematical Insight into Vanishing Gradients

  1. Chain Rule and Backpropagation:

    • During BPTT, the gradient of the loss function with respect to the weights involves the product of many gradient terms (one for each time step).
    • Mathematically, the gradient at time step ( t ) is given by: $ \frac{\partial L}{\partial W} = \sum_{t=1}^{T} \left( \frac{\partial L}{\partial h_t} \cdot \frac{\partial h_t}{\partial W} \right) $ where ( L ) is the loss, ( W ) are the weights, and ( h_t ) is the hidden state at time step ( t ).
  2. Repeated Multiplication:

    • The term ( \frac{\partial ht}{\partial h{t-1}} ) involves the weight matrix and the activation function’s derivative, denoted as ( f’ ).
    • This derivative term is typically less than 1 (especially for activation functions like the sigmoid or tanh).
    • Therefore, the gradient term becomes: [ \frac{\partial L}{\partial W} \approx \prod_{k=1}^{t} f’(h_k) W ]
    • When multiplied many times (as in long sequences), these terms ( f’(h_k) ) being less than 1 cause the product to shrink exponentially.

Key Points on Vanishing Gradients

  • Activation Functions: Activation functions like sigmoid and tanh squash their inputs to a very small range, and their derivatives are even smaller (between 0 and 1), which exacerbates the vanishing gradient problem.
  • Weight Matrix: If the weight matrix has values less than 1, repeated multiplications further reduce the gradient’s magnitude.

Consequences

  • Poor Learning of Long-Term Dependencies: The network fails to propagate meaningful gradients from later time steps to earlier ones, making it difficult to learn dependencies that span many time steps.
  • Slow Convergence: The learning process becomes very slow as the gradients provide little information for weight updates.

Mitigation Strategies

  1. Use of LSTM and GRU:

    • LSTMs and GRUs are designed with gates that allow gradients to flow more effectively over long sequences. They use mechanisms like the forget gate, input gate, and output gate to maintain and update the cell state selectively.
  2. Gradient Clipping:

    • This technique involves setting a threshold value for gradients. If the gradient exceeds this threshold, it is scaled down to prevent it from growing too large or too small.
  3. Weight Initialization:

    • Proper initialization of weights can help mitigate the problem. Techniques like Xavier or He initialization set the initial weights in a way that maintains the variance of activations and gradients through layers.
  4. Alternative Architectures:

    • Architectures like Transformers, which do not rely on recurrent connections and instead use self-attention mechanisms, can handle long-range dependencies more effectively without suffering from vanishing gradients.

By understanding and addressing the vanishing gradient problem, RNNs and their variants can be effectively trained to capture long-term dependencies in sequential data.

What is Vanishing/Exploding Gradients?

ewweewf

RNN Notes

Table of Contents:

  • Introduction to RNN
  • RNN example as Character-level language model
  • Multilayer RNNs
  • Long-Short Term Memory (LSTM)

Introduction to RNN

In this lecture note, we’re going to be talking about the Recurrent Neural Networks (RNNs). One great thing about the RNNs is that they offer a lot of flexibility on how we wire up the neural network architecture. Normally when we’re working with neural networks (Figure 1), we are given a fixed sized input vector (red), then we process it with some hidden layers (green), and we produce a fixed sized output vector (blue) as depicted in the leftmost model (“Vanilla” Neural Networks) in Figure 1. While “Vanilla” Neural Networks receive a single input and produce one label for that image, there are tasks where the model produce a sequence of outputs as shown in the one-to-many model in Figure 1. Recurrent Neural Networks allow us to operate over sequences of input, output, or both at the same time.

  • An example of one-to-many model is image captioning where we are given a fixed sized image and produce a sequence of words that describe the content of that image through RNN (second model in Figure 1).
  • An example of many-to-one task is action prediction where we look at a sequence of video frames instead of a single image and produce a label of what action was happening in the video as shown in the third model in Figure 1. Another example of many-to-one task is sentiment classification in NLP where we are given a sequence of words of a sentence and then classify what sentiment (e.g. positive or negative) that sentence is.
  • An example of many-to-many task is video-captioning where the input is a sequence of video frames and the output is caption that describes what was in the video as shown in the fourth model in Figure 1. Another example of many-to-many task is machine translation in NLP, where we can have an RNN that takes a sequence of words of a sentence in English, and then this RNN is asked to produce a sequence of words of a sentence in French.
  • There is a also a variation of many-to-many task as shown in the last model in Figure 1, where the model generates an output at every timestep. An example of this many-to-many task is video classification on a frame level where the model classifies every single frame of video with some number of classes. We should note that we don’t want this prediction to only be a function of the current timestep (current frame of the video), but also all the timesteps (frames) that have come before this video.

In general, RNNs allow us to wire up an architecture, where the prediction at every single timestep is a function of all the timesteps that have come before.

Why are existing convnets insufficient?

The existing convnets are insufficient to deal with tasks that have inputs and outputs with variable sequence lengths. In the example of video captioning, inputs have variable number of frames (e.g. 10-minute and 10-hour long video) and outputs are captions of variable length. Convnets can only take in inputs with a fixed size of width and height and cannot generalize over inputs with different sizes. In order to tackle this problem, we introduce Recurrent Neural Networks (RNNs).

Recurrent Neural Network

RNN is basically a blackbox (Left of Figure 2), where it has an “internal state” that is updated as a sequence is processed. At every single timestep, we feed in an input vector into RNN where it modifies that state as a function of what it receives. When we tune RNN weights, RNN will show different behaviors in terms of how its state evolves as it receives these inputs. We are also interested in producing an output based on the RNN state, so we can produce these output vectors on top of the RNN (as depicted in Figure 2).

If we unroll an RNN model (Right of Figure 2), then there are inputs (e.g. video frame) at different timesteps shown as $x_1, x_2, x_3$ … $x_t$. RNN at each timestep takes in two inputs – an input frame ($x_i$) and previous representation of what it seems so far (i.e. history) – to generate an output $y_i$ and update its history, which will get forward propagated over time. All the RNN blocks in Figure 2 (Right) are the same block that share the same parameter, but have different inputs and history at each timestep.

More precisely, RNN can be represented as a recurrence formula of some function $f_W$ with parameters $W$:

\[h_t = f_W(h_{t-1}, x_t)\]

where at every timestep it receives some previous state as a vector $h_{t-1}$ of previous iteration timestep $t-1$ and current input vector $x_t$ to produce the current state as a vector $h_t$. A fixed function $f_W$ with weights $W$ is applied at every single timestep and that allows us to use the Recurrent Neural Network on sequences without having to commit to the size of the sequence because we apply the exact same function at every single timestep, no matter how long the input or output sequences are.

In the most simplest form of RNN, which we call a Vanilla RNN, the network is just a single hidden state $h$ where we use a recurrence formula that basically tells us how we should update our hidden state $h$ as a function of previous hidden state $h_{t-1}$ and the current input $x_t$. In particular, we’re going to have weight matrices $W_{hh}$ and $W_{xh}$, where they will project both the hidden state $h_{t-1}$ from the previous timestep and the current input $x_t$, and then those are going to be summed and squished with $tanh$ function to update the hidden state $h_t$ at timestep $t$. This recurrence is telling us how $h$ will change as a function of its history and also the current input at this timestep:

\[h_t = tanh(W_{hh}h_{t-1} + W_{xh}x_t)\]

We can base predictions on top of $h_t$ by using just another matrix projection on top of the hidden state. This is the simplest complete case in which you can wire up a neural network:

\[y_t = W_{hy}h_t\]

So far we have showed RNN in terms of abstract vectors $x, h, y$, however we can endow these vectors with semantics in the following section.

RNN example as Character-level language model

One of the simplest ways in which we can use an RNN is in the case of a character-level language model since it’s intuitive to understand. The way this RNN will work is we will feed a sequence of characters into the RNN and at every single timestep, we will ask the RNN to predict the next character in the sequence. The prediction of RNN will be in the form of score distribution of the characters in the vocabulary for what RNN thinks should come next in the sequence that it has seen so far.

So suppose, in a very simple example (Figure 3), we have the training sequence of just one string $\text{“hello”}$, and we have a vocabulary $V \in {\text{“h”}, \text{“e”}, \text{“l”}, \text{“o”}}$ of 4 characters in the entire dataset. We are going to try to get an RNN to learn to predict the next character in the sequence on this training data.

As shown in Figure 3, we’ll feed in one character at a time into an RNN, first $\text{“h”}$, then $\text{“e”}$, then $\text{“l”}$, and finally $\text{“l”}$. All characters are encoded in the representation of what’s called a one-hot vector, where only one unique bit of the vector is turned on for each unique character in the vocabulary.

\[\begin{bmatrix}1 \\ 0 \\ 0 \\ 0 \end{bmatrix} = \text{"h"}\ \begin{bmatrix}0 \\ 1 \\ 0 \\ 0 \end{bmatrix} = \text{"e"}\ \ \begin{bmatrix}0 \\ 0 \\ 1 \\ 0 \end{bmatrix} = \text{"l"}\ \ \begin{bmatrix}0 \\ 0 \\ 0 \\ 1 \end{bmatrix} = \text{"o"}\]

Then we’re going to use the recurrence formula from the previous section at every single timestep. Suppose we start off with $h$ as a vector of size 3 with all zeros. By using this fixed recurrence formula, we’re going to end up with a 3-dimensional representation of the next hidden state $h$ that basically at any point in time summarizes all the characters that have come until then:

\[\begin{aligned} \begin{bmatrix}0.3 \\ -0.1 \\ 0.9 \end{bmatrix} &= f_W(W_{hh}\begin{bmatrix}0 \\ 0 \\ 0 \end{bmatrix} + W_{xh}\begin{bmatrix}1 \\ 0 \\ 0 \\ 0 \end{bmatrix}) \ \ \ \ &(1) \\ \begin{bmatrix}1.0 \\ 0.3 \\ 0.1 \end{bmatrix} &= f_W(W_{hh}\begin{bmatrix}0.3 \\ -0.1 \\ 0.9 \end{bmatrix} + W_{xh}\begin{bmatrix}0 \\ 1 \\ 0 \\ 0 \end{bmatrix}) \ \ \ \ &(2) \\ \begin{bmatrix}0.1 \\ -0.5 \\ -0.3 \end{bmatrix} &= f_W(W_{hh}\begin{bmatrix}1.0 \\ 0.3 \\ 0.1 \end{bmatrix} + W_{xh}\begin{bmatrix}0 \\ 0 \\ 1 \\ 0 \end{bmatrix}) \ \ \ \ &(3) \\ \begin{bmatrix}-0.3 \\ 0.9 \\ 0.7 \end{bmatrix} &= f_W(W_{hh}\begin{bmatrix}0.1 \\ -0.5 \\ -0.3 \end{bmatrix} + W_{xh}\begin{bmatrix}0 \\ 0 \\ 1 \\ 0 \end{bmatrix}) \ \ \ \ &(4) \end{aligned}\]

As we apply this recurrence at every timestep, we’re going to predict what should be the next character in the sequence at every timestep. Since we have four characters in vocabulary $V$, we’re going to predict 4-dimensional vector of logits at every single timestep.

As shown in Figure 3, in the very first timestep we fed in $\text{“h”}$, and the RNN with its current setting of weights computed a vector of logits:

\[\begin{bmatrix}1.0 \\ 2.2 \\ -3.0 \\ 4.1 \end{bmatrix} \rightarrow \begin{bmatrix}\text{"h"} \\ \text{"e"} \\ \text{"l"}\\ \text{"o"} \end{bmatrix}\]

where RNN thinks that the next character $\text{“h”}$ is $1.0$ likely to come next, $\text{“e”}$ is $2.2$ likely, $\text{“e”}$ is $-3.0$ likely, and $\text{“o”}$ is $4.1$ likely to come next. In this case, RNN incorrectly suggests that $\text{“o”}$ should come next, as the score of $4.1$ is the highest. However, of course, we know that in this training sequence $\text{“e”}$ should follow $\text{“h”}$, so in fact the score of $2.2$ is the correct answer as it’s highlighted in green in Figure 3, and we want that to be high and all other scores to be low. At every single timestep we have a target for what next character should come in the sequence, therefore the error signal is backpropagated as a gradient of the loss function through the connections. As a loss function we could choose to have a softmax classifier, for example, so we just get all those losses flowing down from the top backwards to calculate the gradients on all the weight matrices to figure out how to shift the matrices so that the correct probabilities are coming out of the RNN. Similarly we can imagine how to scale up the training of the model over larger training dataset.

Multilayer RNNs

So far we have only shown RNNs with just one layer. However, we’re not limited to only a single layer architectures. One of the ways, RNNs are used today is in more complex manner. RNNs can be stacked together in multiple layers, which gives more depth, and empirically deeper architectures tend to work better (Figure 4).

For example, in Figure 4, there are three separate RNNs each with their own set of weights. Three RNNs are stacked on top of each other, so the input of the second RNN (second RNN layer in Figure 4) is the vector of the hidden state vector of the first RNN (first RNN layer in Figure 4). All stacked RNNs are trained jointly, and the diagram in Figure 4 represents one computational graph.

Long-Short Term Memory (LSTM)

So far we have seen only a simple recurrence formula for the Vanilla RNN. In practice, we actually will rarely ever use Vanilla RNN formula. Instead, we will use what we call a Long-Short Term Memory (LSTM) RNN.

Vanilla RNN Gradient Flow & Vanishing Gradient Problem

An RNN block takes in input $x_t$ and previous hidden representation $h_{t-1}$ and learn a transformation, which is then passed through tanh to produce the hidden representation $h_{t}$ for the next time step and output $y_{t}$ as shown in the equation below.

\[h*t = tanh(W*{hh}h*{t-1} + W*{xh}x_t)\]

For the back propagation, Let’s examine how the output at the very last timestep affects the weights at the very first time step. The partial derivative of $h_t$ with respect to $h_{t-1}$ is written as: \(\frac{\partial h*t}{\partial h*{t-1}} = tanh^{'}(W*{hh}h*{t-1} + W*{xh}x_t)W*{hh}\)

We update the weights $W_{hh}$ by getting the derivative of the loss at the very last time step $L_{t}$ with respect to $W_{hh}$.

\[\begin{aligned} \frac{\partial L_{t}}{\partial W_{hh}} = \frac{\partial L_{t}}{\partial h_{t}} \frac{\partial h_{t}}{\partial h_{t-1} } \dots \frac{\partial h_{1}}{\partial W_{hh}} \\ = \frac{\partial L_{t}}{\partial h_{t}}(\prod_{t=2}^{T} \frac{\partial h_{t}}{\partial h_{t-1}})\frac{\partial h_{1}}{\partial W_{hh}} \\ = \frac{\partial L_{t}}{\partial h_{t}}(\prod_{t=2}^{T} tanh^{'}(W_{hh}h_{t-1} + W_{xh}x_t)W_{hh}^{T-1})\frac{\partial h_{1}}{\partial W_{hh}} \\ \end{aligned}\]
  • Vanishing gradient: We see that $tanh^{‘}(W_{hh}h_{t-1} + W_{xh}x_t)$ will almost always be less than 1 because tanh is always between negative one and one. Thus, as $t$ gets larger (i.e. longer timesteps), the gradient ($\frac{\partial L_{t}}{\partial W} $) will descrease in value and get close to zero. This will lead to vanishing gradient problem, where gradients at future time steps rarely impact gradients at the very first time step. This is problematic when we model long sequence of inputs because the updates will be extremely slow.

  • Removing non-linearity (tanh): If we remove non-linearity (tanh) to solve the vanishing gradient problem, then we will be left with \(\begin{aligned} \frac{\partial L_{t}}{\partial W} = \frac{\partial L_{t}}{\partial h_{t}}(\prod_{t=2}^{T} W_{hh}^{T-1})\frac{\partial h_{1}}{\partial W} \end{aligned}\) * Exploding gradients: If the largest singular value of W_{hh} is greater than 1, then the gradients will blow up and the model will get very large gradients coming back from future time steps. Exploding gradient often leads to getting gradients that are NaNs. * Vanishing gradients: If the laregest singular value of W_{hh} is smaller than 1, then we will have vanishing gradient problem as mentioned above which will significantly slow down learning.

In practice, we can treat the exploding gradient problem through gradient clipping, which is clipping large gradient values to a maximum threshold. However, since vanishing gradient problem still exists in cases where largest singular value of W_{hh} matrix is less than one, LSTM was designed to avoid this problem.

LSTM Formulation

The following is the precise formulation for LSTM. On step $t$, there is a hidden state $h_t$ and a cell state $c_t$. Both $h_t$ and $c_t$ are vectors of size $n$. One distinction of LSTM from Vanilla RNN is that LSTM has this additional $c_t$ cell state, and intuitively it can be thought of as $c_t$ stores long-term information. LSTM can read, erase, and write information to and from this $c_t$ cell. The way LSTM alters $c_t$ cell is through three special gates: $i, f, o$ which correspond to “input”, “forget”, and “output” gates. The values of these gates vary from closed (0) to open (1). All $i, f, o$ gates are vectors of size $n$.

At every timestep we have an input vector $x_t$, previous hidden state $h_{t-1}$, previous cell state $c_{t-1}$, and LSTM computes the next hidden state $h_t$, and next cell state $c_t$ at timestep $t$ as follows:

\[\begin{aligned} f_t &= \sigma(W_{hf}h_{t_1} + W_{xf}x_t) \\ i_t &= \sigma(W_{hi}h_{t_1} + W_{xi}x_t) \\ o_t &= \sigma(W_{ho}h_{t_1} + W_{xo}x_t) \\ g_t &= \text{tanh}(W_{hg}h_{t_1} + W_{xg}x_t) \\ \end{aligned}\] \[\begin{aligned} c_t &= f_t \odot c_{t-1} + i_t \odot g_t \\ h_t &= o_t \odot \text{tanh}(c_t) \\ \end{aligned}\]

where $\odot$ is an element-wise Hadamard product. $g_t$ in the above formulas is an intermediary calculation cache that’s later used with $o$ gate in the above formulas.

Since all $f, i, o$ gate vector values range from 0 to 1, because they were squashed by sigmoid function $\sigma$, when multiplied element-wise, we can see that:

  • Forget Gate: Forget gate $f_t$ at time step $t$ controls how much information needs to be “removed” from the previous cell state $c_{t-1}$. This forget gate learns to erase hidden representations from the previous time steps, which is why LSTM will have two hidden represtnations $h_t$ and cell state $c_t$. This $c_t$ will get propagated over time and learn whether to forget the previous cell state or not.
  • Input Gate: Input gate $i_t$ at time step $t$ controls how much information needs to be “added” to the next cell state $c_t$ from previous hidden state $h_{t-1}$ and input $x_t$. Instead of tanh, the “input” gate $i$ has a sigmoid function, which converts inputs to values between zero and one. This serves as a switch, where values are either almost always zero or almost always one. This “input” gate decides whether to take the RNN output that is produced by the “gate” gate $g$ and multiplies the output with input gate $i$.
  • Output Gate: Output gate $o_t$ at time step $t$ controls how much information needs to be “shown” as output in the current hidden state $h_t$.

The key idea of LSTM is the cell state, the horizontal line running through between recurrent timesteps. You can imagine the cell state to be some kind of highway of information passing through straight down the entire chain, with only some minor linear interactions. With the formulation above, it’s easy for information to just flow along this highway (Figure 5). Thus, even when there is a bunch of LSTMs stacked together, we can get an uninterrupted gradient flow where the gradients flow back through cell states instead of hidden states $h$ without vanishing in every time step.

This greatly fixes the gradient vanishing/exploding problem we have outlined above. Figure 5 also shows that gradient contains a vector of activations of the “forget” gate. This allows better control of gradients values by using suitable parameter updates of the “forget” gate.

Does LSTM solve the vanishing gradient problem?

LSTM architecture makes it easier for the RNN to preserve information over many recurrent time steps. For example, if the forget gate is set to 1, and the input gate is set to 0, then the infomation of the cell state will always be preserved over many recurrent time steps. For a Vanilla RNN, in contrast, it’s much harder to preserve information in hidden states in recurrent time steps by just making use of a single weight matrix.

LSTMs do not guarantee that there is no vanishing/exploding gradient problems, but it does provide an easier way for the model to learn long-distance dependencies.

This post is licensed under CC BY 4.0 by the author.