Versions being used: tensorflow-gpu: 2.0, CUDA v10, CuDNN v7.6.5, Python 3.7.4
(使用的版本:tensorflow-gpu:2.0,CUDA v10,CuDNN v7.6.5,Python 3.7.4)
System specs: i9-7920X, 4 x RTX 2080Ti, 128GB 2400MHz RAM, 2TB SATA SSD
(系统规格:i9-7920X,4个RTX 2080Ti,128GB 2400MHz RAM,2TB SATA SSD)
Issue:
(问题:)
While training any model using tensorflow 2.0, randomly during a epoch, the GPU will freeze and the power usage of the GPU will fall to around 70W with Core usage sitting at 0 and memory utilization also fixed at some random value.
(当使用tensorflow 2.0训练任何模型时,在某个时期内,GPU会冻结,GPU的功耗将降至70W左右,其中Core的使用为0,内存的使用也固定为某个随机值。)
I also do not get any error or exception when this happens. (发生这种情况时,我也不会收到任何错误或异常。)
Only way to restore is to restart the jupyter kernel and run from the beginning. (还原的唯一方法是重新启动jupyter内核并从头开始运行。)
I first thought that probably something was wrong with my code. (我首先以为我的代码可能有问题。)
So I figured I would try to replicate the issue while training a Densenet on Cifar100 and the issue persisted. (因此,我想我将在使用Cifar100训练Densenet时尝试重现该问题,并且该问题仍然存在。)
If I run the training on multiple GPUs, then too the GPUs freeze, but it happens very rarely.
(如果我在多个GPU上进行训练,那么GPU也会冻结,但是这种情况很少发生。)
But with single GPU, it is guaranteed to get stuck at some point or the other. (但是,使用单个GPU可以确保在某些时候卡住。)
Below is the code used for training Cifar100
(以下是用于训练Cifar100的代码)
from densenet import DenseNet
from tensorflow.keras.datasets import cifar100
import tensorflow as tf
from tqdm import tqdm_notebook as tqdm
# the data, shuffled and split between train and test sets
(X_train, y_train), (X_test, y_test) = cifar100.load_data(label_mode='fine')
num_classes = 100
y_test_original = y_test
# Convert class vectors to binary class matrices. [one hot encoding]
y_train = tf.keras.utils.to_categorical(y_train, num_classes)
y_test = tf.keras.utils.to_categorical(y_test, num_classes)
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
for i in range(3):
mean = np.mean(X_train[:,:,:,i])
std = np.std(X_train[:,:,:,i])
X_train[:,:,:,i] = (X_train[:,:,:,i] - mean)/std
X_test[:,:,:,i] = (X_test[:,:,:,i] - mean)/std
with tf.device('/gpu:0'):
model = DenseNet(input_shape=(32,32,3), dense_blocks=3, dense_layers=-1, growth_rate=12, nb_classes=100, dropout_rate=0.2,
bottleneck=True, compression=0.5, weight_decay=1e-4, depth=100)
optimizer = tf.keras.optimizers.SGD(learning_rate=0.01,
momentum=0.9,
nesterov=True,
name='SGD')
model.compile(loss = 'categorical_crossentropy', optimizer = optimizer, metrics = ['accuracy'])
def scheduler(epoch):
if epoch < 151:
return 0.01
elif epoch < 251:
return 0.001
elif epoch < 301:
return 0.0001
callback = tf.keras.callbacks.LearningRateScheduler(scheduler)
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=300, verbose = 2)
PS: I even tried the code on my laptop which has a i7-8750h and an RTX 2060 with 32GB and 970 EVO NVME.
(PS:我什至在我的笔记本电脑上尝试了该代码,该笔记本电脑配备了i7-8750h和带有32GB和970 EVO NVME的RTX 2060。)
Unfortunately I had the same problem of GPU freezing. (不幸的是,我遇到了GPU冻结的同样问题。)
Does anyone know what the issue is?
(有人知道这个问题是什么吗?)
ask by Akash Nandi translate from so