Gated Recurrent Units (GRU)

As mentioned above, RNN suffers from vanishing/exploding gradients and can´t remember states for very long. GRU - Cho, 2014, is an application of multiplicative modules that attempts to solve these problems. It's an example of recurrent net with memory (another is LSTM). The structure of a GRU is shown below: ![[Pasted image 20240115232829.png|600]] $z_t = \sigma_g(\mathbf{W}_z \mathbf{x}_t + \mathbf{U}_z \mathbf{h}_{t-1} + \mathbf{b}_z)$ $r_t = \sigma_g(\mathbf{W}_r \mathbf{x}_t + \mathbf{U}_r \mathbf{h}_{t-1} + \mathbf{b}_r)$ $\mathbf{h}_t = \mathbf{z}_t \odot \mathbf{h}_{t-1} + (1 - \mathbf{z}_t) \odot \phi_h(\mathbf{W}_h \mathbf{x}_t + \mathbf{U}_h (\mathbf{r}_t \odot \mathbf{h}_{t-1}) + \mathbf{b}_h)$ where $\odot$ denotes element-wise multiplication (Hadamard product), $x_t$ is the input vector, $h_t$ is the output vector, $z_t$ is the update gate vector, $r_t$ is the reset gate vector, $\phi_h$ is a hyperbolic tanh, and **W, U, b** are learnable parameters. To be specific, $z_t$ is a gating vector that determines how much of the past information should be passed along to the future. It applies a sigmoid function to the sum of two linear layers and a bias over the input $x_t$ and the previous state $h_{t-1}$. $z_t$ contains coefficients between 0 and 1 as a result of applying sigmoid. The final output state $h_t$ is a convex combination of $h_{t-1}$ and $\phi_h(\mathbf{W}_h \mathbf{x}_t + \mathbf{U}_h (\mathbf{r}_t \odot \mathbf{h}_{t-1}) + \mathbf{b}_h)$ via $z_t$. If the coefficient is 1, the current unit output is just a copy of the previous state and ignores the input (which is the default behaviour). If it is less than one, then it takes into account some new information from the input. The reset gate $r_t$ is used to decide how much of the past information to forget. In the new memory content $\phi_h(\mathbf{W}_h \mathbf{x}_t + \mathbf{U}_h (\mathbf{r}_t \odot \mathbf{h}_{t-1}) + \mathbf{b}_h)$, if the coefficient in $r_t$ is 0, then it stores none of the information from the past. If at the same time $z_t$ is 0, then the system is completely reset since $h_t$ would only look at the input.