Neural network debugging is too difficult, here are six practical tips

The bottleneck of neural network-based projects is usually not the realization of the network. This is often referred to as "gradient checking" and helps to ensure that the gradient is calculated correctly. One method is to use finite differences. Make sure that the size ratio is reasonable. Once the reason for the explosion/disappearance of the gradient is figured out, there are various solutions to solve this problem, such as adding residual connections to better propagate the gradient or simply using a smaller network. The activation function can also cause the gradient to explode/disappear. If the input of the sigmoid activation function is too large, the gradient will be very close to zero. Second, there may be error patterns in the output of the neural network that cannot be quantitatively displayed. If the smaller network succeeds when the full-size network fails, it indicates that the network architecture of the full-size model is too complex. If an error occurs during backpropagation, you can print the gradient of the weights layer by layer from the last layer until you find the difference.

Source