Table of Contents
Gradient vanishing and exploding are common issues in training deep neural networks. They occur when gradients become too small or too large during backpropagation, affecting the learning process. Understanding these phenomena involves analyzing the calculations behind weight updates and exploring potential solutions.
Gradient Vanishing
Gradient vanishing happens when the gradients decrease exponentially as they are propagated backward through layers. This results in very small weight updates, causing the network to learn very slowly or stop learning altogether.
Mathematically, if the activation function’s derivative is less than 1, the gradient at layer l can be expressed as:
∂L/∂wl = (∂L/∂al) * (∂al/∂zl) * (∂zl/∂wl)
where ∂al/∂zl is the derivative of the activation function. If this derivative is less than 1, repeated multiplication causes the gradient to diminish exponentially.
Gradient Exploding
Gradient exploding occurs when the gradients grow exponentially during backpropagation. This leads to very large weight updates, which can cause instability and divergence in training.
Mathematically, if the derivatives involved are greater than 1, the gradients can increase exponentially:
∂L/∂wl ≈ (∂L/∂al+1) * (∂al+1/∂zl+1) * (∂zl+1/∂al) * …
Solutions to Vanishing and Exploding Gradients
Several techniques can mitigate these issues:
- Weight Initialization: Using methods like Xavier or He initialization helps maintain stable gradients.
- Activation Functions: ReLU and its variants reduce the risk of vanishing gradients.
- Gradient Clipping: Limiting the maximum value of gradients prevents explosion.
- Normalization: Batch normalization stabilizes the learning process.