Understanding Gradient Vanishing and Exploding: Calculations and Solutions

Gradient vanishing and exploding are common issues in training deep neural networks. They occur when gradients become too small or too large during backpropagation, affecting the learning process. Understanding these phenomena involves analyzing the calculations behind weight updates and exploring potential solutions.

Gradient Vanishing

Gradient vanishing happens when the gradients decrease exponentially as they are propagated backward through layers. This results in very small weight updates, causing the network to learn very slowly or stop learning altogether.

Mathematically, if the activation function’s derivative is less than 1, the gradient at layer l can be expressed as:

∂L/∂wl = (∂L/∂al) * (∂al/∂zl) * (∂zl/∂wl)

where ∂al/∂zl is the derivative of the activation function. If this derivative is less than 1, repeated multiplication causes the gradient to diminish exponentially.

Gradient Exploding

Gradient exploding occurs when the gradients grow exponentially during backpropagation. This leads to very large weight updates, which can cause instability and divergence in training.

Mathematically, if the derivatives involved are greater than 1, the gradients can increase exponentially:

∂L/∂wl ≈ (∂L/∂al+1) * (∂al+1/∂zl+1) * (∂zl+1/∂al) * …

Solutions to Vanishing and Exploding Gradients

Several techniques can mitigate these issues:

  • Weight Initialization: Using methods like Xavier or He initialization helps maintain stable gradients.
  • Activation Functions: ReLU and its variants reduce the risk of vanishing gradients.
  • Gradient Clipping: Limiting the maximum value of gradients prevents explosion.
  • Normalization: Batch normalization stabilizes the learning process.