'tf.data ()' lançant Votre entrée a manqué de données; interrompre la formation

Question

Je vois des problèmes étranges en essayant d'utiliser tf.data() pour générer des données par lots avec keras api. Il continue de lancer des erreurs disant qu'il manque de données de formation.

TensorFlow 2.1

import numpy as np import nibabel import tensorflow as tf from tensorflow.keras.layers import Conv3D, MaxPooling3D from tensorflow.keras.layers import Dense from tensorflow.keras.layers import Dropout from tensorflow.keras.layers import Flatten from tensorflow.keras import Model import os import random """Configure GPUs to prevent OOM errors""" gpus = tf.config.experimental.list_physical_devices('GPU') for gpu in gpus: tf.config.experimental.set_memory_growth(gpu, True) """Retrieve file names""" ad_files = os.listdir("/home/asdf/OASIS/3D/ad/") cn_files = os.listdir("/home/asdf/OASIS/3D/cn/") sub_id_ad = [] sub_id_cn = [] """OASIS AD: 178 Subjects, 278 3T MRIs""" """OASIS CN: 588 Subjects, 1640 3T MRIs""" """Down-sampling CN to 278 MRIs""" random.Random(129).shuffle(ad_files) random.Random(129).shuffle(cn_files) """Split files for training""" ad_train = ad_files[0:276] cn_train = cn_files[0:276] """Shuffle Train data and Train labels""" train = ad_train + cn_train labels = np.concatenate((np.ones(len(ad_train)), np.zeros(len(cn_train))), axis=None) random.Random(129).shuffle(train) random.Random(129).shuffle(labels) print(len(train)) print(len(labels)) """Change working directory to OASIS/3D/all/""" os.chdir("/home/asdf/OASIS/3D/all/") """Create tf data pipeline""" def load_image(file, label): nifti = np.asarray(nibabel.load(file.numpy().decode('utf-8')).get_fdata()) xs, ys, zs = np.where(nifti != 0) nifti = nifti[min(xs):max(xs) + 1, min(ys):max(ys) + 1, min(zs):max(zs) + 1] nifti = nifti[0:100, 0:100, 0:100] nifti = np.reshape(nifti, (100, 100, 100, 1)) nifti = tf.convert_to_tensor(nifti, np.float64) return nifti, label @tf.autograph.experimental.do_not_convert def load_image_wrapper(file, labels): return tf.py_function(load_image, [file, labels], [tf.float64, tf.float64]) dataset = tf.data.Dataset.from_tensor_slices((train, labels)) dataset = dataset.shuffle(6, 129) dataset = dataset.repeat(50) dataset = dataset.map(load_image_wrapper, num_parallel_calls=6) dataset = dataset.batch(6) dataset = dataset.prefetch(buffer_size=1) iterator = iter(dataset) batch_images, batch_labels = iterator.get_next() ######################################################################################## with tf.device("/cpu:0"): with tf.device("/gpu:0"): model = tf.keras.Sequential() model.add(Conv3D(64, input_shape=(100, 100, 100, 1), data_format='channels_last', kernel_size=(7, 7, 7), strides=(2, 2, 2), padding='valid', activation='relu')) with tf.device("/gpu:1"): model.add(Conv3D(64, kernel_size=(3, 3, 3), padding='valid', activation='relu')) with tf.device("/gpu:2"): model.add(Conv3D(128, kernel_size=(3, 3, 3), padding='valid', activation='relu')) model.add(MaxPooling3D(pool_size=(2, 2, 2), padding='valid')) model.add(Flatten()) model.add(Dense(256, activation='relu')) model.add(Dense(1, activation='sigmoid')) model.compile(loss=tf.keras.losses.binary_crossentropy, optimizer=tf.keras.optimizers.Adagrad(0.01), metrics=['accuracy']) ######################################################################################## model.fit(batch_images, batch_labels, steps_per_Epoch=92, epochs=50)

Après avoir créé le jeu de données, je mélange et ajoute le paramètre de répétition au num_of_epochs, Soit 50 dans ce cas. Cela fonctionne, mais il plante après la 3ème époque, et je n'arrive pas à comprendre ce que je fais mal dans ce cas particulier. Suis-je supossé pour déclarer les instructions de répétition et de lecture aléatoire en haut du pipeline?

Voici l'erreur:

Epoch 3/50 92/6 [============================================================================================================================================================================================================================================================================================================================================================================================================================================================================] - 3s 36ms/sample - loss: 0.1902 - accuracy: 0.8043 Epoch 4/50 5/6 [========================>.....] - ETA: 0s - loss: 0.2216 - accuracy: 0.80002020-03-06 15:18:17.804126: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: End of sequence [[{{node IteratorGetNext}}]] [[BiasAddGrad_3/_54]] 2020-03-06 15:18:17.804137: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: End of sequence [[{{node IteratorGetNext}}]] [[sequential/conv3d_3/Conv3D/ReadVariableOp/_21]] 2020-03-06 15:18:17.804140: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: End of sequence [[{{node IteratorGetNext}}]] [[Conv3DBackpropFilterV2_3/_68]] 2020-03-06 15:18:17.804263: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: End of sequence [[{{node IteratorGetNext}}]] [[sequential/dense/MatMul/ReadVariableOp/_30]] 2020-03-06 15:18:17.804364: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: End of sequence [[{{node IteratorGetNext}}]] [[BiasAddGrad_5/_62]] 2020-03-06 15:18:17.804561: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: End of sequence [[{{node IteratorGetNext}}]] WARNING:tensorflow:Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least `steps_per_Epoch * epochs` batches (in this case, 4600 batches). You may need to use the repeat() f24/6 [========================================================================================================================] - 1s 36ms/sample - loss: 0.1673 - accuracy: 0.8750 Traceback (most recent call last): File "python_scripts/gpu_farm/tf_data_generator/3D_tf_data_generator.py", line 181, in <module> evaluation_ad = model.evaluate(ad_test, ad_test_labels, verbose=0) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training.py", line 930, in evaluate use_multiprocessing=use_multiprocessing) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_v2.py", line 490, in evaluate use_multiprocessing=use_multiprocessing, **kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_v2.py", line 426, in _model_iteration use_multiprocessing=use_multiprocessing) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_v2.py", line 646, in _process_inputs x, y, sample_weight=sample_weights) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training.py", line 2383, in _standardize_user_data batch_size=batch_size) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training.py", line 2489, in _standardize_tensors y, self._feed_loss_fns, feed_output_shapes) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_utils.py", line 810, in check_loss_and_target_compatibility ' while using as loss `' + loss_name + '`. ' ValueError: A target array with shape (5, 2) was passed for an output of shape (None, 1) while using as loss `binary_crossentropy`. This loss expects targets to have the same shape as the output.

Mise à jour: Donc model.fit() doit être fourni avec model.fit(x=data, y=labels), lors de l'utilisation de tf.data() car d'un problème étrange. Cela supprime l'erreur list out of index. Et maintenant je suis de retour à mon erreur d'origine. Cependant, il semble que cela pourrait être un problème de tensorflow: https://github.com/tensorflow/tensorflow/issues/32

Ainsi, lorsque j'augmente la taille du lot de 6 à des nombres plus élevés et que je diminue le steps_per_Epoch, Cela traverse plus d'époques sans lancer les erreurs StartAbort: Out of range

Update2: Selon la suggestion @ jkjung13, model.fit() prend un paramètre lors de l'utilisation d'un ensemble de données, model.fit(x=batch). C'est la bonne implémentation.

Mais, vous êtes censé fournir le dataset au lieu d'un objet itérable si vous n'utilisez que le paramètre x dans model.fit().

Donc, ça devrait être: model.fit(dataset, epochs=50, steps_per_Epoch=46, validation_data=(v, v_labels))

Et avec cela, j'obtiens une nouvelle erreur: GitHub Issue

Maintenant, pour surmonter cela, je convertis l'ensemble de données en numpy_iterator (): model.fit(dataset.as_numpy_iterator(), epochs=50, steps_per_Epoch=46, validation_data=(v, v_labels))

Cela résout le problème, cependant, les performances sont épouvantables, similaires aux anciennes keras model.fit_generator Sans multitraitement. Donc, cela va à l'encontre de tout l'objectif de "tf.data".

domin thomas · Accepted Answer

TF 2.1

Cela fonctionne maintenant avec les paramètres suivants:

def load_image(file, label): nifti = np.asarray(nibabel.load(file.numpy().decode('utf-8')).get_fdata()).astype(np.float32) xs, ys, zs = np.where(nifti != 0) nifti = nifti[min(xs):max(xs) + 1, min(ys):max(ys) + 1, min(zs):max(zs) + 1] nifti = nifti[0:100, 0:100, 0:100] nifti = np.reshape(nifti, (100, 100, 100, 1)) return nifti, label @tf.autograph.experimental.do_not_convert def load_image_wrapper(file, label): return tf.py_function(load_image, [file, label], [tf.float64, tf.float64]) dataset = tf.data.Dataset.from_tensor_slices((train, labels)) dataset = dataset.map(load_image_wrapper, num_parallel_calls=32) dataset = dataset.prefetch(buffer_size=1) dataset = dataset.apply(tf.data.experimental.prefetch_to_device('/device:GPU:0', 1)) # So, my dataset size is 522, i.e. 522 MRI images. # I need to load the entire dataset as a batch. # This should exceed 60GiBs of RAM, but it doesn't go over 12GiB of RAM. # I'm not sure how tf.data batch() stores the data, maybe a custom file? # And also add a repeat parameter to iterate with each Epoch. dataset = dataset.batch(522, drop_remainder=True).repeat() # Now initialise an iterator iterator = iter(dataset) # Create two objects, x & y, from batch batch_image, batch_label = iterator.get_next() ################################################################################## with tf.device("/cpu:0"): with tf.device("/gpu:0"): model = tf.keras.Sequential() model.add(Conv3D(64, input_shape=(100, 100, 100, 1), data_format='channels_last', kernel_size=(7, 7, 7), strides=(2, 2, 2), padding='valid', activation='relu')) with tf.device("/gpu:1"): model.add(Conv3D(64, kernel_size=(3, 3, 3), padding='valid', activation='relu')) with tf.device("/gpu:2"): model.add(Conv3D(128, kernel_size=(3, 3, 3), padding='valid', activation='relu')) model.add(MaxPooling3D(pool_size=(2, 2, 2), padding='valid')) model.add(Flatten()) model.add(Dense(256, activation='relu')) model.add(Dropout(0.7)) model.add(Dense(1, activation='sigmoid')) model.compile(loss=tf.keras.losses.binary_crossentropy, optimizer=tf.keras.optimizers.Adagrad(0.01), metrics=['accuracy']) ################################################################################## # Now supply x=batch_image, y= batch_label to Keras' model.fit() # And finally, supply your batchs_size here! model.fit(batch_image, batch_label, epochs=100, batch_size=12) ##################################################################################

Avec cela, il faut environ 8 minutes pour que la formation commence. Mais une fois que l'entraînement commence, je constate des vitesses incroyables!

Epoch 30/100 522/522 [==============================] - 14s 26ms/sample - loss: 0.3526 - accuracy: 0.8640 Epoch 31/100 522/522 [==============================] - 15s 28ms/sample - loss: 0.3334 - accuracy: 0.8448 Epoch 32/100 522/522 [==============================] - 16s 31ms/sample - loss: 0.3308 - accuracy: 0.8697 Epoch 33/100 522/522 [==============================] - 14s 26ms/sample - loss: 0.2936 - accuracy: 0.8755 Epoch 34/100 522/522 [==============================] - 14s 26ms/sample - loss: 0.2935 - accuracy: 0.8851 Epoch 35/100 522/522 [==============================] - 14s 28ms/sample - loss: 0.3157 - accuracy: 0.8889 Epoch 36/100 522/522 [==============================] - 16s 31ms/sample - loss: 0.2910 - accuracy: 0.8851 Epoch 37/100 522/522 [==============================] - 14s 26ms/sample - loss: 0.2810 - accuracy: 0.8697 Epoch 38/100 522/522 [==============================] - 14s 26ms/sample - loss: 0.2536 - accuracy: 0.8966 Epoch 39/100 522/522 [==============================] - 16s 31ms/sample - loss: 0.2506 - accuracy: 0.9004 Epoch 40/100 522/522 [==============================] - 15s 28ms/sample - loss: 0.2353 - accuracy: 0.8927 Epoch 41/100 522/522 [==============================] - 14s 26ms/sample - loss: 0.2336 - accuracy: 0.9042 Epoch 42/100 522/522 [==============================] - 14s 26ms/sample - loss: 0.2243 - accuracy: 0.9234 Epoch 43/100 522/522 [==============================] - 15s 29ms/sample - loss: 0.2181 - accuracy: 0.9176

15 secondes par époque par rapport aux anciennes 12 minutes par époque!

Je ferai des tests supplémentaires pour voir si cela fonctionne réellement et quel impact cela a sur mes données de test. S'il y a des erreurs, je reviendrai et mettrai à jour ce post.

Pourquoi ça marche? Je n'ai aucune idée. Je n'ai rien trouvé dans la documentation Keras.