MOO Tensorflow sur GPU

Question

j'entraîne des données musicales sur un RNST LSTM dans Tensorflow et j'ai rencontré un problème avec l'allocation de mémoire GPU que je ne comprends pas: je rencontre un MOO alors qu'il semble qu'il reste assez de VRAM disponible. Quelques informations de base: Je travaille sur Ubuntu Gnome 16.04, avec une mémoire vive (GTX1060) de 6 Go, une puce Intel Xeon E3-1231V3 et une mémoire vive de 8 Go. Alors maintenant, tout d’abord la partie du message d’erreur que je peux comprendre, dans le et j’ajouterai tout le message d’erreur à la fin pour tous ceux qui pourraient demander de l’aide:

Je tensorflow/core/common_runtime/bfc_allocator.cc: 696] 8 morceaux de taille 256 totalisant 2,0 Ko I tensorflow/core/common_runtime/bfc_allocator.cc: 696] 1 Morceaux de taille 1280 totalisant 1,2 Ko I tensorflow/core/common_runtime/bfc_allocator.cc: 696] 5 morceaux de taille 44288 pour un total de 216,2 Ko I tensorflow/core/common_runtime/bfc_allocator.cc: 696] 5 morceaux de taille 56064 totalisant 273,8 Ko I tensorflow/core/common_runtime/bfc_allocator.cc: 696] 4 morceaux de taille 154350080 totalisant 588,80MiB tensorflow/core/common_runtime/bfc_allocator.cc: 696] 3 morceaux de taille 813400064 pour un total de 2,27 Gio I tensorflow/core/common_runtime/bfc_allocator.cc: 696] 1 Morceaux de taille 1612612352 totalisant 1,50 GiB I tensorflow/core/common_runtime/bfc_allocator.cc: 700] Somme totale de morceaux en cours d'utilisation: 4.35GiB I tensorflow/core/common_runtime/bfc_allocator.cc: 702] Stats:

Limite: 5484118016

InUse: 4670717952

MaxInUse: 5484118016

NumAllocs: 29

MaxAllocSize: 1612612352

W tensorflow/core/common_runtime/bfc_allocator.cc: 274] ********************* ___________ * __ ************************* ************************ xxxxxxxxxxxxxxW W tensorflow/core/common_runtime/bfc_allocator.cc: 275] est sorti de mémoire essayant d'allouer 775,72 Mo. Voir les journaux pour l'état de la mémoire. W tensorflow/core/framework/op_kernel.cc: 993] Ressource épuisée: MOO lors de l'affectation d'un tenseur de forme [14525,14000]

Donc, je peux lire qu’il ya un maximum de 5484118016 octets à attribuer, 4670717952 octets sont déjà utilisés et 777,72 Mo = 775720000 octets à attribuer. 5484118016 octets - 4670717952 octets - 775720000 octets = 37680064 octets selon ma calculatrice. Il devrait donc rester 37 Mo de VRAM libre après avoir alloué de la place au nouveau Tenseur qu'il souhaite y intégrer. Cela me semble également tout à fait légitime, car Tensorflow n’essaierait probablement pas (j’estime?) D’attribuer plus de VRAM que ce qui est encore disponible et de simplement mettre le reste des données en attente dans RAM ou quelque chose du genre.

Maintenant, je pense qu'il y a une grosse erreur dans ma pensée, mais je serais très reconnaissant si quelqu'un pouvait m'expliquer quelle est cette erreur. La stratégie de résolution évidente de mon problème est simplement de réduire un peu mes lots, les avoir chacun à environ 1,5 Go est probablement trop gros. Néanmoins, j'aimerais savoir quel est le problème actuel.

edit: j'ai trouvé quelque chose qui me dit d'essayer:

config = tf.ConfigProto() config.gpu_options.allocator_type = 'BFC' with tf.Session(config = config) as s:

qui ne fonctionne toujours pas, mais comme la documentation de tensorflow n’explique aucune explication

 gpu_options.allocator_type = 'BFC'

serait, j'aimerais vous demander les gars.

Ajouter le reste du message d'erreur pour toute personne intéressée:

Désolé pour le long copier/coller, mais peut-être que quelqu'un aurait besoin/voudrait le voir,

Merci beaucoup d'avance, Leon

(gputensorflow) leon@ljksUbuntu:~/Tensorflow$ python Netzwerk_v0.5.1_gamma.py I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations. W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations. W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations. W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations. W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations. W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations. I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:910] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties: name: GeForce GTX 1060 6GB major: 6 minor: 1 memoryClockRate (GHz) 1.7335 pciBusID 0000:01:00.0 Total memory: 5.93GiB Free memory: 5.40GiB I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1060 6GB, pci bus id: 0000:01:00.0) I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (256): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (512): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (1024): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (2048): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (4096): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (8192): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (16384): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (32768): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (65536): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (131072): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (262144): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (524288): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (1048576): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (2097152): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (4194304): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (8388608): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (16777216): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (33554432): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (67108864): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (134217728): Total Chunks: 1, Chunks in use: 0 147.20MiB allocated for chunks. 147.20MiB client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (268435456): Total Chunks: 1, Chunks in use: 0 628.52MiB allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/bfc_allocator.cc:660] Bin for 775.72MiB was 256.00MiB, Chunk State: I tensorflow/core/common_runtime/bfc_allocator.cc:666] Size: 628.52MiB | Requested Size: 0B | in_use: 0, prev: Size: 147.20MiB | Requested Size: 147.20MiB | in_use: 1, next: Size: 54.8KiB | Requested Size: 54.7KiB | in_use: 1 I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x10208000000 of size 1280 I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x10208000500 of size 256 I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x10208000600 of size 56064 I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x1020800e100 of size 256 I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x1020800e200 of size 44288 I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x10208018f00 of size 256 I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x10208019000 of size 256 I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x10208019100 of size 813400064 I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x102387d1100 of size 56064 I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x102387dec00 of size 154350080 I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x10241b11e00 of size 44288 I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x10241b1cb00 of size 256 I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x10241b1cc00 of size 256 I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x10241b1cd00 of size 154350080 I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x102722d4d00 of size 56064 I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x1027b615a00 of size 44288 I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x1027b620700 of size 256 I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x1027b620800 of size 256 I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x1027b620900 of size 813400064 I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x102abdd8900 of size 813400064 I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x102dc590900 of size 56064 I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x102dc59e400 of size 56064 I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x102dc5abf00 of size 154350080 I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x102e58df100 of size 154350080 I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x102eec12300 of size 44288 I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x102eec1d000 of size 44288 I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x102eec27d00 of size 1612612352 I tensorflow/core/common_runtime/bfc_allocator.cc:687] Free at 0x1024ae4ff00 of size 659049984 I tensorflow/core/common_runtime/bfc_allocator.cc:687] Free at 0x102722e2800 of size 154350080 I tensorflow/core/common_runtime/bfc_allocator.cc:693] Summary of in-use Chunks by size: I tensorflow/core/common_runtime/bfc_allocator.cc:696] 8 Chunks of size 256 totalling 2.0KiB I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 1280 totalling 1.2KiB I tensorflow/core/common_runtime/bfc_allocator.cc:696] 5 Chunks of size 44288 totalling 216.2KiB I tensorflow/core/common_runtime/bfc_allocator.cc:696] 5 Chunks of size 56064 totalling 273.8KiB I tensorflow/core/common_runtime/bfc_allocator.cc:696] 4 Chunks of size 154350080 totalling 588.80MiB I tensorflow/core/common_runtime/bfc_allocator.cc:696] 3 Chunks of size 813400064 totalling 2.27GiB I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 1612612352 totalling 1.50GiB I tensorflow/core/common_runtime/bfc_allocator.cc:700] Sum Total of in-use chunks: 4.35GiB I tensorflow/core/common_runtime/bfc_allocator.cc:702] Stats: Limit: 5484118016 InUse: 4670717952 MaxInUse: 5484118016 NumAllocs: 29 MaxAllocSize: 1612612352 W tensorflow/core/common_runtime/bfc_allocator.cc:274] *********************___________*__***************************************************xxxxxxxxxxxxxx W tensorflow/core/common_runtime/bfc_allocator.cc:275] Ran out of memory trying to allocate 775.72MiB. See logs for memory state. W tensorflow/core/framework/op_kernel.cc:993] Resource exhausted: OOM when allocating tensor with shape[14525,14000] Traceback (most recent call last): File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1022, in _do_call return fn(*args) File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1004, in _run_fn status, run_metadata) File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/contextlib.py", line 66, in __exit__ next(self.gen) File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 469, in raise_exception_on_not_ok_status pywrap_tensorflow.TF_GetCode(status)) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[14525,14000] [[Node: rnn/basic_lstm_cell/weights/Initializer/random_uniform = Add[T=DT_FLOAT, _class=["loc:@rnn/basic_lstm_cell/weights"], _device="/job:localhost/replica:0/task:0/gpu:0"](rnn/basic_lstm_cell/weights/Initializer/random_uniform/mul, rnn/basic_lstm_cell/weights/Initializer/random_uniform/min)]] During handling of the above exception, another exception occurred: Traceback (most recent call last): File "Netzwerk_v0.5.1_gamma.py", line 171, in <module> session.run(tf.global_variables_initializer()) File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 767, in run run_metadata_ptr) File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 965, in _run feed_dict_string, options, run_metadata) File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1015, in _do_run target_list, options, run_metadata) File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1035, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[14525,14000] [[Node: rnn/basic_lstm_cell/weights/Initializer/random_uniform = Add[T=DT_FLOAT, _class=["loc:@rnn/basic_lstm_cell/weights"], _device="/job:localhost/replica:0/task:0/gpu:0"](rnn/basic_lstm_cell/weights/Initializer/random_uniform/mul, rnn/basic_lstm_cell/weights/Initializer/random_uniform/min)]] Caused by op 'rnn/basic_lstm_cell/weights/Initializer/random_uniform', defined at: File "Netzwerk_v0.5.1_gamma.py", line 94, in <module> initial_state=initial_state, time_major=False) # time_major = FALSE currently File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/rnn.py", line 545, in dynamic_rnn dtype=dtype) File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/rnn.py", line 712, in _dynamic_rnn_loop swap_memory=swap_memory) File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2626, in while_loop result = context.BuildLoop(cond, body, loop_vars, shape_invariants) File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2459, in BuildLoop pred, body, original_loop_vars, loop_vars, shape_invariants) File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2409, in _BuildLoop body_result = body(*packed_vars_for_body) File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/rnn.py", line 697, in _time_step (output, new_state) = call_cell() File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/rnn.py", line 683, in <lambda> call_cell = lambda: cell(input_t, state) File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/contrib/rnn/python/ops/core_rnn_cell_impl.py", line 179, in __call__ concat = _linear([inputs, h], 4 * self._num_units, True, scope=scope) File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/contrib/rnn/python/ops/core_rnn_cell_impl.py", line 747, in _linear "weights", [total_arg_size, output_size], dtype=dtype) File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py", line 988, in get_variable custom_getter=custom_getter) File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py", line 890, in get_variable custom_getter=custom_getter) File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py", line 348, in get_variable validate_shape=validate_shape) File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py", line 333, in _true_getter caching_device=caching_device, validate_shape=validate_shape) File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py", line 684, in _get_single_variable validate_shape=validate_shape) File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/variables.py", line 226, in __init__ expected_shape=expected_shape) File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/variables.py", line 303, in _init_from_args initial_value(), name="initial_value", dtype=dtype) File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py", line 673, in <lambda> shape.as_list(), dtype=dtype, partition_info=partition_info) File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/init_ops.py", line 360, in __call__ dtype, seed=self.seed) File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/random_ops.py", line 246, in random_uniform return math_ops.add(rnd * (maxval - minval), minval, name=name) File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/gen_math_ops.py", line 73, in add result = _op_def_lib.apply_op("Add", x=x, y=y, name=name) File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 763, in apply_op op_def=op_def) File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 2395, in create_op original_op=self._default_original_op, op_def=op_def) File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1264, in __init__ self._traceback = _extract_stack() ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[14525,14000] [[Node: rnn/basic_lstm_cell/weights/Initializer/random_uniform = Add[T=DT_FLOAT, _class=["loc:@rnn/basic_lstm_cell/weights"], _device="/job:localhost/replica:0/task:0/gpu:0"](rnn/basic_lstm_cell/weights/Initializer/random_uniform/mul, rnn/basic_lstm_cell/weights/Initializer/random_uniform/min)]]

jeck yung · Accepted Answer

Essayez de regarder ça

Veillez à ne pas exécuter le binaire d'évaluation et de formation sur le même GPU ou vous risquez de manquer de mémoire. Envisagez de lancer le évaluation sur un GPU séparé si disponible ou suspension de la formation binaire lors de l'exécution de l'évaluation sur le même GPU.

https://www.tensorflow.org/tutorials/deep_cnn

susan097 · Answer

Je résous ce problème en réduisant batch_size=52 Seulement pour réduire l'utilisation de la mémoire, c'est pour réduire batch_size.

La taille de lot dépend de votre carte graphique gpu, de la taille de la mémoire VRAM, de la mémoire cache, etc.

Veuillez préférer ceci Un autre lien de débordement de pile

Sriram Veturi · Answer

Je suis tombé sur le même problème. J'ai fermé toutes les fenêtres de l'invite anaconda et effacé toutes les tâches Python. Rouvrez une fenêtre d'invite Anaconda et exécutez le fichier train.py. Cela a fonctionné pour moi la prochaine fois. Les terminaux Anaconda et Python occupaient une mémoire qui ne laissait pas de place pour le processus d’entraînement.

Essayez également de réduire la taille du lot du processus de formation si l’approche ci-dessus ne fonctionne pas.

J'espère que cela t'aides ????

Essayez également de réduire la taille du lot du processus de formation si l’approche ci-dessus ne fonctionne pas.

J'espère que cela t'aides ????

Maruf · Answer

Lorsque vous rencontrez le MOO sur un GPU, je pense que changer batch size est la bonne option à essayer au début.

Pour des GPU différents, vous aurez peut-être besoin d'une taille de lot différente en fonction du GPU la mémoire que vous avez.

Récemment, j'ai rencontré le même type de problème, beaucoup modifié pour faire le type d'expérience différent.

Voici le lien vers la question (quelques astuces sont également incluses).

Cependant, tout en réduisant la taille du lot, vous constaterez peut-être que votre entraînement devient plus lent. Donc, si vous avez plusieurs GPU, vous pouvez les utiliser. Pour vérifier sur votre GPU, vous pouvez écrire sur le terminal,

nvidia-smi

Il vous montrera les informations nécessaires sur votre rack gpu.