Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
181 views
in Technique[技术] by (71.8m points)

python - Final step of PyTorch Gradient Accumulation for small datasets

I am training a BERT model on a relatively small dataset and cannot afford to lose any labelled sample as they must all be used for training. Due to GPU memory constraints, I am using gradient accumulation to train on larger batches (e.g. 32). According to PyTorch documentation, gradient accumulation is implemented as follows:

scaler = GradScaler()

for epoch in epochs:
    for i, (input, target) in enumerate(data):
        with autocast():
            output = model(input)
            loss = loss_fn(output, target)
            loss = loss / iters_to_accumulate

        # Accumulates scaled gradients.
        scaler.scale(loss).backward()

        if (i + 1) % iters_to_accumulate == 0:
            # may unscale_ here if desired (e.g., to allow clipping unscaled gradients)

            scaler.step(optimizer)
            scaler.update()
            optimizer.zero_grad()

However, if you are using e.g. 110 training samples, with batch size 8 and accumulation step 4 (i.e. effective batch size 32), this method would only train the first 96 samples (i.e. 32 x 3), i.e. wasting 14 samples. In order to avoid this, I'd like to modify the code as follows (notice change to the final if statement):

scaler = GradScaler()

for epoch in epochs:
    for i, (input, target) in enumerate(data):
        with autocast():
            output = model(input)
            loss = loss_fn(output, target)
            loss = loss / iters_to_accumulate

        # Accumulates scaled gradients.
        scaler.scale(loss).backward()

        if (i + 1) % iters_to_accumulate == 0 or (i + 1) == len(data):
            # may unscale_ here if desired (e.g., to allow clipping unscaled gradients)

            scaler.step(optimizer)
            scaler.update()
            optimizer.zero_grad()

Is this correct and really that simple, or will this have any side effects? It seems very simple to me, but I've never seen it done before. Any help appreciated!

question from:https://stackoverflow.com/questions/65842691/final-step-of-pytorch-gradient-accumulation-for-small-datasets

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

As Lucas Ramos already mentioned, when using DataLoader where the underlying dataset's size is not divisible by the batch size, the default behavior is to have a smaller last batch:

drop_last (bool, optional) – set to True to drop the last incomplete batch, if the dataset size is not divisible by the batch size. If False and the size of dataset is not divisible by the batch size, then the last batch will be smaller. (default: False)

Your plan is basically implementing gradient accumulation combined with drop_last=False - that is having the last batch smaller than all others.
Therefore, in principle there's nothing wrong with training with varying batch sizes.

However, there is something you need to fix in your code:
The loss is averaged over the mini-batch. So, if you process mini batches in the usual way you do not need to worry about it. However, when accumulating gradients you do it explicitly by dividing the loss by iters_to_accumulate:

loss = loss / iters_to_accumulate

In the last mini batch (with smaller size) you need to change the value of iter_to_accumulate to reflect this smaller minibatch size!

I proposed this revised code, breaking the training loop into two: an outer loop on mini-batches, and an inner one that accumulates gradients per mini batch. Note how using an iter over the DataLoader helps breaking the training loop into two:

scaler = GradScaler()

for epoch in epochs: 
    bi = 0  # index batches
    # outer loop over minibatches
    data_iter = iter(data)
    while bi < len(data):
        # determine the range for this batch
        nbi = min(len(data), bi + iters_to_accumulate)
        # inner loop over the items of the mini batch - accumulating gradients
        for i in range(bi, nbi):
            input, target = data_iter.next()
            with autocast():
                output = model(input)
                loss = loss_fn(output, target)
                loss = loss / (nbi - bi)  # divide by the true batch size

            # Accumulates scaled gradients.
            scaler.scale(loss).backward()
        # done mini batch loop - gradients were accumulated, we can make an optimizatino step.
        
        # may unscale_ here if desired (e.g., to allow clipping unscaled gradients)
        scaler.step(optimizer)
        scaler.update()
        optimizer.zero_grad()
        bi = nbi 

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...