Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
506 views
in Technique[技术] by (71.8m points)

tensorflow - 使用tf.keras模型进行训练时GPU会随机冻结(GPU freezes randomly while training using tf.keras models)

Versions being used: tensorflow-gpu: 2.0, CUDA v10, CuDNN v7.6.5, Python 3.7.4

(使用的版本:tensorflow-gpu:2.0,CUDA v10,CuDNN v7.6.5,Python 3.7.4)

System specs: i9-7920X, 4 x RTX 2080Ti, 128GB 2400MHz RAM, 2TB SATA SSD

(系统规格:i9-7920X,4个RTX 2080Ti,128GB 2400MHz RAM,2TB SATA SSD)

Issue:

(问题:)

While training any model using tensorflow 2.0, randomly during a epoch, the GPU will freeze and the power usage of the GPU will fall to around 70W with Core usage sitting at 0 and memory utilization also fixed at some random value.

(当使用tensorflow 2.0训练任何模型时,在某个时期内,GPU会冻结,GPU的功耗将降至70W左右,其中Core的使用为0,内存的使用也固定为某个随机值。)

I also do not get any error or exception when this happens.

(发生这种情况时,我也不会收到任何错误或异常。)

Only way to restore is to restart the jupyter kernel and run from the beginning.

(还原的唯一方法是重新启动jupyter内核并从头开始运行。)

I first thought that probably something was wrong with my code.

(我首先以为我的代码可能有问题。)

So I figured I would try to replicate the issue while training a Densenet on Cifar100 and the issue persisted.

(因此,我想我将在使用Cifar100训练Densenet时尝试重现该问题,并且该问题仍然存在。)

If I run the training on multiple GPUs, then too the GPUs freeze, but it happens very rarely.

(如果我在多个GPU上进行训练,那么GPU也会冻结,但是这种情况很少发生。)

But with single GPU, it is guaranteed to get stuck at some point or the other.

(但是,使用单个GPU可以确保在某些时候卡住。)

Below is the code used for training Cifar100

(以下是用于训练Cifar100的代码)

from densenet import DenseNet
from tensorflow.keras.datasets import cifar100
import tensorflow as tf
from tqdm import tqdm_notebook as tqdm

# the data, shuffled and split between train and test sets
(X_train, y_train), (X_test, y_test) = cifar100.load_data(label_mode='fine')
num_classes = 100
y_test_original = y_test

# Convert class vectors to binary class matrices. [one hot encoding]
y_train = tf.keras.utils.to_categorical(y_train, num_classes)
y_test = tf.keras.utils.to_categorical(y_test, num_classes)

X_train = X_train.astype('float32')
X_test = X_test.astype('float32')

for i in range(3):
    mean = np.mean(X_train[:,:,:,i])
    std = np.std(X_train[:,:,:,i])
    X_train[:,:,:,i] = (X_train[:,:,:,i] - mean)/std
    X_test[:,:,:,i] = (X_test[:,:,:,i] - mean)/std


with tf.device('/gpu:0'):  
    model = DenseNet(input_shape=(32,32,3), dense_blocks=3, dense_layers=-1, growth_rate=12, nb_classes=100, dropout_rate=0.2,
             bottleneck=True, compression=0.5, weight_decay=1e-4, depth=100)


optimizer = tf.keras.optimizers.SGD(learning_rate=0.01, 
                                    momentum=0.9, 
                                    nesterov=True,
                                    name='SGD')
model.compile(loss = 'categorical_crossentropy', optimizer = optimizer, metrics = ['accuracy'])

def scheduler(epoch):
    if epoch < 151:
        return 0.01
    elif epoch < 251:
        return 0.001
    elif epoch < 301:
        return 0.0001

callback = tf.keras.callbacks.LearningRateScheduler(scheduler)

model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=300, verbose = 2)

PS: I even tried the code on my laptop which has a i7-8750h and an RTX 2060 with 32GB and 970 EVO NVME.

(PS:我什至在我的笔记本电脑上尝试了该代码,该笔记本电脑配备了i7-8750h和带有32GB和970 EVO NVME的RTX 2060。)

Unfortunately I had the same problem of GPU freezing.

(不幸的是,我遇到了GPU冻结的同样问题。)

Does anyone know what the issue is?

(有人知道这个问题是什么吗?)

  ask by Akash Nandi translate from so

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

It could be related to the packages versions, try downgrading to tensorflow 1.14.

(它可能与软件包版本有关,请尝试降级到tensorflow 1.14。)

It could also be that you are using gpu relocation.

(也可能是您正在使用gpu重定位。)

Try donig it on cpu like :

(在cpu上尝试donig,例如:)

 with tf.device('/cpu:0'):
    model = DenseNet(....)

parallel_model = multi_gpu_model(model, gpus=4)

then You can use parallel_model to train and so on.

(那么您可以使用parallel_model进行训练等等。)


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...