Handle CUDA Out of Memory Error in Pytorch

Jiguang Li

Center for Applied Artiﬁcial Intelligence

May 6th, 2022

Motivation

Suppose we want to build a large CNN model that can predict

multiple outcomes from X-ray images.

The average X-ray images have very large resolution

(2000 × 2000).

We run into CUDA out of memory error with batch size as

small as 8, after downsampling image to 512 × 512.

Question: What are some general approaches to avoid

downsampling/ or to use larger batch size?

Data Parallelization

Model-based Parallelization

Gradient Check Pointing

· · ·

Motivation

Suppose we want to build a large CNN model that can predict

multiple outcomes from X-ray images.

The average X-ray images have very large resolution

(2000 × 2000).

We run into CUDA out of memory error with batch size as

small as 8, after downsampling image to 512 × 512.

Question: What are some general approaches to avoid

downsampling/ or to use larger batch size?

Data Parallelization

Model-based Parallelization

Gradient Check Pointing

· · ·

Motivation

Suppose we want to build a large CNN model that can predict

multiple outcomes from X-ray images.

The average X-ray images have very large resolution

(2000 × 2000).

We run into CUDA out of memory error with batch size as

small as 8, after downsampling image to 512 × 512.

Question: What are some general approaches to avoid

downsampling/ or to use larger batch size?

Data Parallelization

Model-based Parallelization

Gradient Check Pointing

· · ·

Motivation

Suppose we want to build a large CNN model that can predict

multiple outcomes from X-ray images.

The average X-ray images have very large resolution

(2000 × 2000).

We run into CUDA out of memory error with batch size as

small as 8, after downsampling image to 512 × 512.

Question: What are some general approaches to avoid

downsampling/ or to use larger batch size?

Data Parallelization

Model-based Parallelization

Gradient Check Pointing

· · ·

Motivation

Suppose we want to build a large CNN model that can predict

multiple outcomes from X-ray images.

The average X-ray images have very large resolution

(2000 × 2000).

We run into CUDA out of memory error with batch size as

small as 8, after downsampling image to 512 × 512.

Question: What are some general approaches to avoid

downsampling/ or to use larger batch size?

Data Parallelization

Model-based Parallelization

Gradient Check Pointing

· · ·

Our Multi-head CNN Model

Multiple medical imaging

research works have shown

DenseNet architecture works

well for x-ray images [2, 3].

The dense block requires

high GPU memory.

We have made our model

ﬂexible to adjust faster

iterations!

Approach 1: Data Parallelization: nn.DataParallel

Replicate a copy of our

model in each GPU.

Split minibatch across all

GPUs.

Forward: each replica

handles a portion of the

input.

Backward: gradients from

each replica are summed

into the original module.

Approach 1: Data Parallelization: nn.DataParallel

Replicate a copy of our

model in each GPU.

Split minibatch across all

GPUs.

Forward: each replica

handles a portion of the

input.

Backward: gradients from

each replica are summed

into the original module.

Approach 1: Data Parallelization: nn.DataParallel

Replicate a copy of our

model in each GPU.

Split minibatch across all

GPUs.

Forward: each replica

handles a portion of the

input.

Backward: gradients from

each replica are summed

into the original module.

Approach 1: Data Parallelization: nn.DataParallel

Replicate a copy of our

model in each GPU.

Split minibatch across all

GPUs.

Forward: each replica

handles a portion of the

input.

Backward: gradients from

each replica are summed

into the original module.

Data Parallelization: The Good The bad and The Ugly

The Good

Very easy to implement.

Fast: taking advantage of multiple GPUs.

Save Memory: each GPU only gets smaller number of images.

The Bad

What if the models are large?

Unstable training due to batch normalization.

Data Parallelization: The Good The bad and The Ugly

The Good

Very easy to implement.

Fast: taking advantage of multiple GPUs.

Save Memory: each GPU only gets smaller number of images.

The Bad

What if the models are large?

Unstable training due to batch normalization.

Approach 2: Model-based Parallelization

Evenly distribute a single

model into multiple GPUs.

During forward pass, each

GPU is only responsible for

one component of the

calculation.

Approach 2: Model-based Parallelization

Evenly distribute a single

model into multiple GPUs.

During forward pass, each

GPU is only responsible for

one component of the

calculation.

DenseNet169 Model-based Parallelization

Figure: DenseNet169 Model-Based Parallelization Implementation

Model-Based Parallelization: The Good The bad and The

Ugly

Almost the inverse of data parallelization.

The Good

Stable training: we are passing all batch of images to each

layer.

Save Memory: especially when our model is too large. Note

only one GPU is working at the same time.

The Bad

We have to implement from scratch.

Very slow: slower than using one GPU.

Model-Based Parallelization: The Good The bad and The

Ugly

Almost the inverse of data parallelization.

The Good

Stable training: we are passing all batch of images to each

layer.

Save Memory: especially when our model is too large. Note

only one GPU is working at the same time.

The Bad

We have to implement from scratch.

Very slow: slower than using one GPU.

Model-Based Parallelization: Faster Version

Figure: DenseNet169 Model-Based Parallelization + Data Parallel

Model-Based + Data Parallelization: implementation

I have done a truly remarkable implementation which this margin is

too small to contain.

Figure: DenseNet169 Model-Based Parallelization + Data Parallel

Approach 3: Gradient-Checkpointing

Intuition

The total memory used by a neural network is the static

memory used by the model (weights), and dynamic memory

formed from computational graph.

During forward pass, gradient checkpointing omits part of the

activation values from the computational graph.

During Back propagation, we recalculate the forward pass on

demand.

We can show gradient checkpointing can only cost O(

(n))

memory to train a n layer network, with an extra forward pass

cost for each mini-batch [1].

Approach 3: Gradient-Checkpointing

Intuition

The total memory used by a neural network is the static

memory used by the model (weights), and dynamic memory

formed from computational graph.

During forward pass, gradient checkpointing omits part of the

activation values from the computational graph.

During Back propagation, we recalculate the forward pass on

demand.

We can show gradient checkpointing can only cost O(

(n))

memory to train a n layer network, with an extra forward pass

cost for each mini-batch [1].

Approach 3: Gradient-Checkpointing

Intuition

The total memory used by a neural network is the static

memory used by the model (weights), and dynamic memory

formed from computational graph.

During forward pass, gradient checkpointing omits part of the

activation values from the computational graph.

During Back propagation, we recalculate the forward pass on

demand.

We can show gradient checkpointing can only cost O(

(n))

memory to train a n layer network, with an extra forward pass

cost for each mini-batch [1].

Approach 3: Gradient-Checkpointing

Intuition

The total memory used by a neural network is the static

memory used by the model (weights), and dynamic memory

formed from computational graph.

During forward pass, gradient checkpointing omits part of the

activation values from the computational graph.

During Back propagation, we recalculate the forward pass on

demand.

We can show gradient checkpointing can only cost O(

(n))

memory to train a n layer network, with an extra forward pass

cost for each mini-batch [1].

Gradient-Checkpointing: Implementation

Figure: DenseNet169 Gradient Checkpointing

Gradient Checkpointing: The Good The bad and The Ugly

The Good

Relatively easy to implement.

Stable Training.

Save Memory: no need to record parts of activation values.

One GPU is all you need.

You can combine nn.DataParallel and gradient checkpointing.

The Bad

Slow, one more forward pass.

Gradient Checkpointing: The Good The bad and The Ugly

The Good

Relatively easy to implement.

Stable Training.

Save Memory: no need to record parts of activation values.

One GPU is all you need.

You can combine nn.DataParallel and gradient checkpointing.

The Bad

Slow, one more forward pass.

Reference

1 Chen, Tianqi et al. “Training Deep Nets with Sublinear

Memory Cost.” ArXiv abs/1604.06174 (2016): n. pag.

2 Irvin, Jeremy A. et al. “CheXpert: A Large Chest Radiograph

Dataset with Uncertainty Labels and Expert Comparison.”

AAAI (2019).

3 Rajpurkar, Pranav et al. “MURA: Large Dataset for

Abnormality Detection in Musculoskeletal Radiographs.”

arXiv: Medical Physics (2017): n. pag.