Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
481 views
in Technique[技术] by (71.8m points)

python - Tensorflow TPU Error: Shuffle Buffer Filled?

I have been trying to use tensorflow's TPU's to train a computer vision model but keep getting an error when I commit the notebook in kaggle's environment.

It is really weird because when I run manually run the notebook it works fine and finishes in <20mins, but when I commit the notebook 8/10 times it gets stuck and says the following for 9 hours before the notebooks dies.

Kaggle Notebook

Stuck with the following message:

2021-01-08 00:28:59.042056: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:221] Shuffle Buffer Filled.

What I have tried

  1. lowering the buffer_size
  2. Changing order of load function
  3. Batch_size tuning

If anyone knows what is happening please let me know!

Data Pipeline

AUTOTUNE = tf.data.experimental.AUTOTUNE
GCS_PATH = KaggleDatasets().get_gcs_path('cassava-leaf-disease-tfrecords-center-512x512')
BATCH_SIZE = 16 * strategy.num_replicas_in_sync
IMAGE_SIZE = [512, 512]
TARGET_SIZE = 512
CLASSES = ['0', '1', '2', '3', '4']
EPOCHS = 20

def decode_image(image_data):
    image = tf.image.decode_jpeg(image_data, channels=3) #decoding jpeg-encoded img to uint8 tensor
    image = tf.cast(image, tf.float32) / 255.0 #cast int val to float so we can normalize it
    image = tf.image.resize(image, [*IMAGE_SIZE]) #added this back seeing if it does anything
    image = tf.reshape(image, [*IMAGE_SIZE, 3]) #resizing to proper shape
    return image


def read_tfrecord(example, labeled=True):
    if labeled:
        TFREC_FORMAT = {
            'image': tf.io.FixedLenFeature([], tf.string), 
            'target': tf.io.FixedLenFeature([], tf.int64), 
        }
    else:
        TFREC_FORMAT = {
            'image': tf.io.FixedLenFeature([], tf.string), 
            'image_name': tf.io.FixedLenFeature([], tf.string), 
        }
    example = tf.io.parse_single_example(example, TFREC_FORMAT)
    image = decode_image(example['image'])
    if labeled:
        label_or_name = tf.cast(example['target'], tf.int32)
    else:
        label_or_name =  example['image_name']
    return image, label_or_name

def load_dataset(filenames, labeled=True, ordered=False):
    ignore_order = tf.data.Options()
    if not ordered:
        ignore_order.experimental_deterministic = False

    dataset = tf.data.TFRecordDataset(filenames, num_parallel_reads=AUTOTUNE)
    dataset = dataset.with_options(ignore_order)
    dataset = dataset.map(lambda x: read_tfrecord(x, labeled=labeled), num_parallel_calls=AUTOTUNE)
    return dataset

def get_training_dataset():
    dataset = load_dataset(TRAINING_FILENAMES, labeled=True)
    dataset = dataset.map(transform, num_parallel_calls=AUTOTUNE)
    dataset = dataset.repeat() # the training dataset must repeat for several epochs
    dataset = dataset.shuffle(2048) #set higher than input?
    dataset = dataset.batch(BATCH_SIZE)
    dataset = dataset.prefetch(AUTOTUNE) # prefetch next batch while training (autotune prefetch buffer size)
    return dataset

Fitting the model:

history = model.fit(x=get_training_dataset(),
                    epochs=EPOCHS,
                    steps_per_epoch = STEPS_PER_EPOCH,
                    validation_steps=VALID_STEPS,
                    validation_data=get_validation_dataset(),
                    callbacks = [lr_callback, model_save, my_early_stopper],
                    verbose=1,
                   )

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)
等待大神答复

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...