Approach 3: Gradient-Checkpointing
Intuition
I
The total memory used by a neural network is the static
memory used by the model (weights), and dynamic memory
formed from computational graph.
I
During forward pass, gradient checkpointing omits part of the
activation values from the computational graph.
I
During Back propagation, we recalculate the forward pass on
demand.
I
We can show gradient checkpointing can only cost O(
p
(n))
memory to train a n layer network, with an extra forward pass
cost for each mini-batch [1].