Created
March 8, 2019 18:47
-
-
Save armandmcqueen/23cc8b27c9eadc64c8c61a7d8c020109 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
019-03-08 18:46:29.848043: I tensorflow/stream_executor/platform/default/dso_loader.cc:161] successfully opened CUDA library libcublas.so.10.0 locally | |
2019-03-08 18:46:29.854419: I tensorflow/stream_executor/platform/default/dso_loader.cc:161] successfully opened CUDA library libcublas.so.10.0 locally | |
2019-03-08 18:46:29.988655: I tensorflow/stream_executor/platform/default/dso_loader.cc:161] successfully opened CUDA library libcublas.so.10.0 locally | |
2019-03-08 18:46:30.011434: I tensorflow/stream_executor/platform/default/dso_loader.cc:161] successfully opened CUDA library libcublas.so.10.0 locally | |
2019-03-08 18:46:30.159645: I tensorflow/stream_executor/platform/default/dso_loader.cc:161] successfully opened CUDA library libcudnn.so.7 locally | |
2019-03-08 18:46:30.162392: I tensorflow/stream_executor/platform/default/dso_loader.cc:161] successfully opened CUDA library libcublas.so.10.0 locally | |
2019-03-08 18:46:30.256675: I tensorflow/stream_executor/platform/default/dso_loader.cc:161] successfully opened CUDA library libcudnn.so.7 locally | |
2019-03-08 18:46:30.288976: I tensorflow/stream_executor/platform/default/dso_loader.cc:161] successfully opened CUDA library libcudnn.so.7 locally | |
2019-03-08 18:46:30.404790: I tensorflow/stream_executor/platform/default/dso_loader.cc:161] successfully opened CUDA library libcudnn.so.7 locally | |
2019-03-08 18:46:30.416447: I tensorflow/stream_executor/platform/default/dso_loader.cc:161] successfully opened CUDA library libcudnn.so.7 locally | |
2019-03-08 18:46:30.586828: I tensorflow/stream_executor/platform/default/dso_loader.cc:161] successfully opened CUDA library libcudnn.so.7 locally | |
2019-03-08 18:46:30.625139: I tensorflow/stream_executor/platform/default/dso_loader.cc:161] successfully opened CUDA library libcublas.so.10.0 locally | |
2019-03-08 18:46:31.067166: I tensorflow/stream_executor/platform/default/dso_loader.cc:161] successfully opened CUDA library libcudnn.so.7 locally | |
2019-03-08 18:46:32.602704: E tensorflow/stream_executor/cuda/cuda_blas.cc:694] failed to run cuBLAS routine cublasSgemmEx: CUBLAS_STATUS_EXECUTION_FAILED | |
2019-03-08 18:46:32.603513: I tensorflow/stream_executor/stream.cc:4825] [stream=0x25c66050,impl=0x25c660f0] did not memzero GPU location; source: 0x7ff0107fac00 | |
2019-03-08 18:46:32.603558: I tensorflow/stream_executor/stream.cc:315] did not allocate timer: 0x7ff0107fabf0 | |
2019-03-08 18:46:32.603567: I tensorflow/stream_executor/stream.cc:1826] [stream=0x25c66050,impl=0x25c660f0] did not enqueue 'start timer': 0x7ff0107fabf0 | |
2019-03-08 18:46:32.603584: I tensorflow/stream_executor/stream.cc:1838] [stream=0x25c66050,impl=0x25c660f0] did not enqueue 'stop timer': 0x7ff0107fabf0 | |
2019-03-08 18:46:32.603591: F tensorflow/stream_executor/gpu/gpu_timer.cc:65] Check failed: start_event_ != nullptr && stop_event_ != nullptr | |
[f1f0059c329e:24414] *** Process received signal *** | |
[f1f0059c329e:24414] Signal: Aborted (6) | |
[f1f0059c329e:24414] Signal code: (-6) | |
[f1f0059c329e:24414] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7ff14b09a390] | |
[f1f0059c329e:24414] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x38)[0x7ff14a5e4428] | |
[f1f0059c329e:24414] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x16a)[0x7ff14a5e602a] | |
[f1f0059c329e:24414] [ 3] 2019-03-08 18:46:32.604949: I tensorflow/stream_executor/stream.cc:4787] [stream=0x25c66050,impl=0x25c660f0] did not memcpy device-to-host; source: 0x7feaa8149d00 | |
2019-03-08 18:46:32.605023: I tensorflow/stream_executor/stream.cc:4787] [stream=0x25c66050,impl=0x25c660f0] did not memcpy device-to-host; source: 0x7feaa8149d00 | |
/usr/local/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(+0x6c5ce04)[0x7ff095afee04] | |
[f1f0059c329e:24414] [ 4] 2019-03-08 18:46:32.605319: I tensorflow/stream_executor/stream.cc:4825] [stream=0x25c66050,impl=0x25c660f0] did not memzero GPU location; source: 0x7ff010ffbc00 | |
2019-03-08 18:46:32.605336: I tensorflow/stream_executor/stream.cc:315] did not allocate timer: 0x7ff010ffbbf0 | |
2019-03-08 18:46:32.605343: I tensorflow/stream_executor/stream.cc:1826] [stream=0x25c66050,impl=0x25c660f0] did not enqueue 'start timer': 0x7ff010ffbbf0 | |
2019-03-08 18:46:32.605360: I tensorflow/stream_executor/stream.cc:1838] [stream=0x25c66050,impl=0x25c660f0] did not enqueue 'stop timer': 0x7ff010ffbbf0 | |
2019-03-08 18:46:32.605367: F tensorflow/stream_executor/gpu/gpu_timer.cc:65] Check failed: start_event_ != nullptr && stop_event_ != nullptr | |
2019-03-08 18:46:32.657870: E tensorflow/stream_executor/cuda/cuda_blas.cc:694] failed to run cuBLAS routine cublasSgemmEx: CUBLAS_STATUS_EXECUTION_FAILED | |
2019-03-08 18:46:32.658628: I tensorflow/stream_executor/stream.cc:4787] [stream=0x273089f0,impl=0x27308a90] did not memcpy device-to-host; source: 0x7f9492b15f00 | |
2019-03-08 18:46:32.658759: I tensorflow/stream_executor/stream.cc:4787] [stream=0x273089f0,impl=0x27308a90] did not memcpy device-to-host; source: 0x7f9492b15f00 | |
2019-03-08 18:46:32.658961: I tensorflow/stream_executor/stream.cc:4787] [stream=0x273089f0,impl=0x27308a90] did not memcpy device-to-host; source: 0x7f9492b15e00 | |
2019-03-08 18:46:32.659006: I tensorflow/stream_executor/stream.cc:4787] [stream=0x273089f0,impl=0x27308a90] did not memcpy device-to-host; source: 0x7f9492b15e00 | |
2019-03-08 18:46:32.659094: I tensorflow/stream_executor/stream.cc:4787] [stream=0x273089f0,impl=0x27308a90] did not memcpy device-to-host; source: 0x7f9492a95d00 | |
2019-03-08 18:46:32.659135: I tensorflow/stream_executor/stream.cc:4787] [stream=0x273089f0,impl=0x27308a90] did not memcpy device-to-host; source: 0x7f9492a95d00 | |
2019-03-08 18:46:32.738744: E tensorflow/stream_executor/cuda/cuda_blas.cc:694] failed to run cuBLAS routine cublasSgemmEx: CUBLAS_STATUS_EXECUTION_FAILED | |
2019-03-08 18:46:32.754039: E tensorflow/stream_executor/cuda/cuda_blas.cc:694] failed to run cuBLAS routine cublasSgemmEx: CUBLAS_STATUS_EXECUTION_FAILED | |
2019-03-08 18:46:32.755964: I tensorflow/stream_executor/stream.cc:4787] [stream=0x277fa330,impl=0x277fce50] did not memcpy device-to-host; source: 0x7f3dcafdbf00 | |
2019-03-08 18:46:32.756091: I tensorflow/stream_executor/stream.cc:4787] [stream=0x277fa330,impl=0x277fce50] did not memcpy device-to-host; source: 0x7f3dcaebbd00 | |
2019-03-08 18:46:32.756130: I tensorflow/stream_executor/stream.cc:4787] [stream=0x277fa330,impl=0x277fce50] did not memcpy device-to-host; source: 0x7f3dcaebbd00 | |
2019-03-08 18:46:32.812762: E tensorflow/stream_executor/cuda/cuda_blas.cc:694] failed to run cuBLAS routine cublasSgemmEx: CUBLAS_STATUS_EXECUTION_FAILED | |
2019-03-08 18:46:32.812868: I tensorflow/stream_executor/stream.cc:1852] [stream=0x27263c00,impl=0x27263ca0] did not wait for [stream=0x3f4718c0,impl=0x272635c0] | |
2019-03-08 18:46:32.812896: I tensorflow/stream_executor/stream.cc:4800] [stream=0x27263c00,impl=0x27263ca0] did not memcpy host-to-device; source: 0x7f0bc207b200 | |
2019-03-08 18:46:32.812940: F tensorflow/core/common_runtime/gpu/gpu_util.cc:339] CPU->GPU Memcpy failed | |
[f1f0059c329e:24413] *** Process received signal *** | |
[f1f0059c329e:24413] Signal: Aborted (6) | |
[f1f0059c329e:24413] Signal code: (-6) | |
[f1f0059c329e:24413] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7f0de86be390] | |
[f1f0059c329e:24413] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x38)[0x7f0de7c08428] | |
[f1f0059c329e:24413] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x16a)[0x7f0de7c0a02a] | |
[f1f0059c329e:24413] [ 3] 2019-03-08 18:46:32.813422: I tensorflow/stream_executor/stream.cc:4787] [stream=0x3f4718c0,impl=0x272635c0] did not memcpy device-to-host; source: 0x7f0b8893b700 | |
2019-03-08 18:46:32.814065: I tensorflow/stream_executor/stream.cc:4825] [stream=0x3f4718c0,impl=0x272635c0] did not memzero GPU location; source: 0x7f0cad7fcc00 | |
2019-03-08 18:46:32.814085: I tensorflow/stream_executor/stream.cc:315] did not allocate timer: 0x7f0cad7fcbf0 | |
2019-03-08 18:46:32.814094: I tensorflow/stream_executor/stream.cc:1826] [stream=0x3f4718c0,impl=0x272635c0] did not enqueue 'start timer': 0x7f0cad7fcbf0 | |
2019-03-08 18:46:32.814109: I tensorflow/stream_executor/stream.cc:1838] [stream=0x3f4718c0,impl=0x272635c0] did not enqueue 'stop timer': 0x7f0cad7fcbf0 | |
2019-03-08 18:46:32.814116: F tensorflow/stream_executor/gpu/gpu_timer.cc:65] Check failed: start_event_ != nullptr && stop_event_ != nullptr | |
2019-03-08 18:46:32.937764: E tensorflow/stream_executor/cuda/cuda_blas.cc:694] failed to run cuBLAS routine cublasSgemmEx: CUBLAS_STATUS_EXECUTION_FAILED | |
2019-03-08 18:46:32.938853: I tensorflow/stream_executor/stream.cc:4787] [stream=0x3b514fb0,impl=0x3b515050] did not memcpy device-to-host; source: 0x7f968f2c7600 | |
2019-03-08 18:46:32.938926: I tensorflow/stream_executor/stream.cc:4787] [stream=0x3b514fb0,impl=0x3b515050] did not memcpy device-to-host; source: 0x7f968f2c7600 | |
[0308 18:46:33 @input_source.py:176] EnqueueThread QueueInput/input_queue Exited. | |
------------------------------------------------------- | |
Primary job terminated normally, but 1 process returned | |
a non-zero exit code. Per user-direction, the job has been aborted. | |
------------------------------------------------------- | |
Traceback (most recent call last): | |
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1335, in _do_call | |
return fn(*args) | |
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1320, in _run_fn | |
options, feed_dict, fetch_list, target_list, run_metadata) | |
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1408, in _call_tf_sessionrun | |
run_metadata) | |
tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(20643840, 1), b.shape=(1, 4), m=20643840, n=4, k=1 | |
[[{{node fpn/fpn/upsample_lat3/Tensordot/MatMul}}]] | |
[[gradients/rpn_losses_batch/level3/boolean_mask_31/GatherV2_grad/Shape/_8415]] | |
During handling of the above exception, another exception occurred: | |
Traceback (most recent call last): | |
File "/tensorpack-mask-rcnn/MaskRCNN/train.py", line 651, in <module> | |
launch_train_with_config(traincfg, trainer) | |
File "/tensorpack-mask-rcnn/tensorpack/train/interface.py", line 94, in launch_train_with_config | |
extra_callbacks=config.extra_callbacks) | |
File "/tensorpack-mask-rcnn/tensorpack/train/base.py", line 343, in train_with_defaults | |
steps_per_epoch, starting_epoch, max_epoch) | |
File "/tensorpack-mask-rcnn/tensorpack/train/base.py", line 315, in train | |
self.main_loop(steps_per_epoch, starting_epoch, max_epoch) | |
File "/tensorpack-mask-rcnn/tensorpack/utils/argtools.py", line 176, in wrapper | |
return func(*args, **kwargs) | |
File "/tensorpack-mask-rcnn/tensorpack/train/base.py", line 280, in main_loop | |
self.run_step() # implemented by subclass | |
File "/tensorpack-mask-rcnn/tensorpack/train/base.py", line 180, in run_step | |
self.hooked_sess.run(self.train_op) | |
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 694, in run | |
run_metadata=run_metadata) | |
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1189, in run | |
run_metadata=run_metadata) | |
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1288, in run | |
raise six.reraise(*original_exc_info) | |
File "/usr/local/lib/python3.6/site-packages/six.py", line 693, in reraise | |
raise value | |
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1273, in run | |
return self._sess.run(*args, **kwargs) | |
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1345, in run | |
run_metadata=run_metadata) | |
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1109, in run | |
return self._sess.run(*args, **kwargs) | |
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 930, in run | |
run_metadata_ptr) | |
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1153, in _run | |
feed_dict_tensor, options, run_metadata) | |
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1329, in _do_run | |
run_metadata) | |
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1349, in _do_call | |
raise type(e)(node_def, op, message) | |
tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(20643840, 1), b.shape=(1, 4), m=20643840, n=4, k=1 | |
[[node fpn/fpn/upsample_lat3/Tensordot/MatMul (defined at tensorpack-mask-rcnn/tensorpack/models/pool.py:130) ]] | |
[[gradients/rpn_losses_batch/level3/boolean_mask_31/GatherV2_grad/Shape/_8415]] | |
Original stack trace for 'fpn/fpn/upsample_lat3/Tensordot/MatMul': | |
File "tensorpack-mask-rcnn/MaskRCNN/train.py", line 651, in <module> | |
launch_train_with_config(traincfg, trainer) | |
File "tensorpack-mask-rcnn/tensorpack/train/interface.py", line 84, in launch_train_with_config | |
model._build_graph_get_cost, model.get_optimizer) | |
File "tensorpack-mask-rcnn/tensorpack/utils/argtools.py", line 176, in wrapper | |
return func(*args, **kwargs) | |
Traceback (most recent call last): | |
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1335, in _do_call | |
return fn(*args) | |
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1320, in _run_fn | |
options, feed_dict, fetch_list, target_list, run_metadata) | |
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1408, in _call_tf_sessionrun | |
run_metadata) | |
tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(20643840, 1), b.shape=(1, 4), m=20643840, n=4, k=1 | |
[[{{node fpn/fpn/upsample_lat3/Tensordot/MatMul}}]] | |
[[gradients/GatherV2_7_grad/Shape/_8953]] | |
During handling of the above exception, another exception occurred: | |
Traceback (most recent call last): | |
File "/tensorpack-mask-rcnn/MaskRCNN/train.py", line 651, in <module> | |
launch_train_with_config(traincfg, trainer) | |
File "/tensorpack-mask-rcnn/tensorpack/train/interface.py", line 94, in launch_train_with_config | |
extra_callbacks=config.extra_callbacks) | |
File "/tensorpack-mask-rcnn/tensorpack/train/base.py", line 343, in train_with_defaults | |
steps_per_epoch, starting_epoch, max_epoch) | |
File "/tensorpack-mask-rcnn/tensorpack/train/base.py", line 315, in train | |
self.main_loop(steps_per_epoch, starting_epoch, max_epoch) | |
File "/tensorpack-mask-rcnn/tensorpack/utils/argtools.py", line 176, in wrapper | |
return func(*args, **kwargs) | |
File "/tensorpack-mask-rcnn/tensorpack/train/base.py", line 280, in main_loop | |
self.run_step() # implemented by subclass | |
File "/tensorpack-mask-rcnn/tensorpack/train/base.py", line 180, in run_step | |
self.hooked_sess.run(self.train_op) | |
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 694, in run | |
run_metadata=run_metadata) | |
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1189, in run | |
run_metadata=run_metadata) | |
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1288, in run | |
raise six.reraise(*original_exc_info) | |
File "/usr/local/lib/python3.6/site-packages/six.py", line 693, in reraise | |
raise value | |
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1273, in run | |
return self._sess.run(*args, **kwargs) | |
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1345, in run | |
run_metadata=run_metadata) | |
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1109, in run | |
return self._sess.run(*args, **kwargs) | |
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 930, in run | |
run_metadata_ptr) | |
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1153, in _run | |
feed_dict_tensor, options, run_metadata) | |
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1329, in _do_run | |
run_metadata) | |
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1349, in _do_call | |
raise type(e)(node_def, op, message) | |
tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(20643840, 1), b.shape=(1, 4), m=20643840, n=4, k=1 | |
[[node fpn/fpn/upsample_lat3/Tensordot/MatMul (defined at tensorpack-mask-rcnn/tensorpack/models/pool.py:130) ]] | |
[[gradients/GatherV2_7_grad/Shape/_8953]] | |
Original stack trace for 'fpn/fpn/upsample_lat3/Tensordot/MatMul': | |
File "tensorpack-mask-rcnn/MaskRCNN/train.py", line 651, in <module> | |
launch_train_with_config(traincfg, trainer) | |
File "tensorpack-mask-rcnn/tensorpack/train/interface.py", line 84, in launch_train_with_config | |
model._build_graph_get_cost, model.get_optimizer) | |
File "tensorpack-mask-rcnn/tensorpack/utils/argtools.py", line 176, in wrapper | |
return func(*args, **kwargs) | |
File "tensorpack-mask-rcnn/tensorpack/train/tower.py", line 216, in setup_gTraceback (most recent call last): | |
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1335, in _do_call | |
return fn(*args) | |
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1320, in _run_fn | |
options, feed_dict, fetch_list, target_list, run_metadata) | |
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1408, in _call_tf_sessionrun | |
run_metadata) | |
tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(20643840, 1), b.shape=(1, 4), m=20643840, n=4, k=1 | |
[[{{node fpn/fpn/upsample_lat3/Tensordot/MatMul}}]] | |
[[gradients/rpn_losses_batch/level3/boolean_mask_31/GatherV2_grad/Shape/_8415]] | |
During handling of the above exception, another exception occurred: | |
Traceback (most recent call last): | |
File "/tensorpack-mask-rcnn/MaskRCNN/train.py", line 651, in <module> | |
launch_train_with_config(traincfg, trainer) | |
File "/tensorpack-mask-rcnn/tensorpack/train/interface.py", line 94, in launch_train_with_config | |
extra_callbacks=config.extra_callbacks) | |
File "/tensorpack-mask-rcnn/tensorpack/train/base.py", line 343, in train_with_defaults | |
steps_per_epoch, starting_epoch, max_epoch) | |
File "/tensorpack-mask-rcnn/tensorpack/train/base.py", line 315, in train | |
self.main_loop(steps_per_epoch, starting_epoch, max_epoch) | |
File "/tensorpack-mask-rcnn/tensorpack/utils/argtools.py", line 176, in wrapper | |
return func(*args, **kwargs) | |
File "/tensorpack-mask-rcnn/tensorpack/train/base.py", line 280, in main_loop | |
self.run_step() # implemented by subclass | |
File "/tensorpack-mask-rcnn/tensorpack/train/base.py", line 180, in run_step | |
self.hooked_sess.run(self.train_op) | |
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 694, in run | |
run_metadata=run_metadata) | |
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1189, in run | |
run_metadata=run_metadata) | |
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1288, in run | |
raise six.reraise(*original_exc_info) | |
File "/usr/local/lib/python3.6/site-packages/six.py", line 693, in reraise | |
raise value | |
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1273, in run | |
return self._sess.run(*args, **kwargs) | |
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1345, in run | |
run_metadata=run_metadata) | |
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1109, in run | |
return self._sess.run(*args, **kwargs) | |
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 930, in run | |
run_metadata_ptr) | |
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1153, in _run | |
feed_dict_tensor, options, run_metadata) | |
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1329, in _do_run | |
run_metadata) | |
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1349, in _do_call | |
raise type(e)(node_def, op, message) | |
tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(20643840, 1), b.shape=(1, 4), m=20643840, n=4, k=1 | |
[[node fpn/fpn/upsample_lat3/Tensordot/MatMul (defined at tensorpack-mask-rcnn/tensorpack/models/pool.py:130) ]] | |
[[gradients/rpn_losses_batch/level3/boolean_mask_31/GatherV2_grad/Shape/_8415]] | |
Original stack trace for 'fpn/fpn/upsample_lat3/Tensordot/MatMul': | |
File "tensorpack-mask-rcnn/MaskRCNN/train.py", line 651, in <module> | |
launch_train_with_config(traincfg, trainer) | |
File "tensorpack-mask-rcnn/tensorpack/train/interface.py", line 84, in launch_train_with_config | |
model._build_graph_get_cost, model.get_optimizer) | |
File "tensorpack-mask-rcnn/tensorpack/utils/argtools.py", line 176, in wrapper | |
return func(*args, **kwargs) | |
File "tensorpack-mask-rcnn/tensorpack/train/tower.py", line 216, in setup_graph | |
train_callbacks = self._setup_graph(input, get_cost_fn, get_opt_fn) | |
File "tensorpack-mask-rcnn/tensorpack/train/trainers.py", line 410, in _setup_graph | |
grads = self._make_get_grad_fn(input, get_cost_fn, get_opt_fn)() | |
File "tensorpack-mask-rcnn/tensorpack/train/tower.py", line 283, in get_grad_fn | |
return compute_grad_from_inputs(*inputs) | |
File "tensorpack-mask-rcnn/tensorpack/train/tower.py", line 247, in compute_grad_from_inputs | |
cost = get_cost_fn(*inputs) | |
File "tensorpack-mask-rcnn/tensorpack/tfutils/tower.py", line 286, in __call__ | |
output = self._tower_fn(*args) | |
File "tensorpack-mask-rcnn/tensorpack/graph_builder/model_desc.py", line 262, in _build_graph_get_cost | |
ret = self.build_graph(*inputs) | |
File "tensorpack-mask-rcnn/MaskRCNN/train.py", line 124, in build_graph | |
features = self.backbone(images) | |
File "tensorpack-mask-rcnn/MaskRCNN/train.py", line 193, in backbone | |
p23456 = fpn_model('fpn', c2345, fp16=self.fp16) | |
File "tensorpack-mask-rcnn/tensorpack/models/registry.py", line 128, in wrapped_func | |
outputs = func(*args, **actual_args) | |
File "tensorpack-mask-rcnn/MaskRCNN/model_fpn.py", line 80, in fpn_model | |
lat = lat + upsample2x('upsample_lat{}'.format(6 - idx), lat_sum_5432[-1]) | |
File "tensorpack-mask-rcnn/MaskRCNN/model_fpn.py", line 57, in upsample2x | |
data_format='channels_first') | |
File "tensorpack-mask-rcnn/tensorpack/models/registry.py", line 128, in wrapped_func | |
outputs = func(*args, **actual_args) | |
File "tensorpack-mask-rcnn/tensorpack/models/pool.py", line 130, in FixedUnPooling | |
ret = tf.tensordot(x, mat, axes=1) # bxcxhxwxshxsw | |
File "usr/local/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py", line 3641, in tensordot | |
ab_matmul = matmul(a_reshape, b_reshape) | |
File "usr/local/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py", line 2513, in matmul | |
a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name) | |
File "usr/local/lib/python3.6/site-packages/tensorflow/python/ops/gen_math_ops.py", line 5675, in mat_mul | |
name=name) | |
File "usr/local/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 800, in _apply_op_helper | |
op_def=op_def) | |
File "usr/local/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func | |
return func(*args, **kwargs) | |
File "usr/local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3473, in create_op | |
op_def=op_def) | |
File "usr/local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1961, in __init__ | |
self._traceback = tf_stack.extract_stack() | |
[0308 18:46:33 @input_source.py:176] EnqueueThread QueueInput/input_queue Exited. | |
[0308 18:46:33 @input_source.py:176] EnqueueThread QueueInput/input_queue Exited. | |
raph | |
train_callbacks = self._setup_graph(input, get_cost_fn, get_opt_fn) | |
File "tensorpack-mask-rcnn/tensorpack/train/trainers.py", line 410, in _setup_graph | |
grads = self._make_get_grad_fn(input, get_cost_fn, get_opt_fn)() | |
File "tensorpack-mask-rcnn/tensorpack/train/tower.py", line 283, in get_grad_fn | |
return compute_grad_from_inputs(*inputs) | |
File "tensorpack-mask-rcnn/tensorpack/train/tower.py", line 247, in compute_grad_from_inputs | |
cost = get_cost_fn(*inputs) | |
File "tensorpack-mask-rcnn/tensorpack/tfutils/tower.py", line 286, in __call__ | |
output = self._tower_fn(*args) | |
File "tensorpack-mask-rcnn/tensorpack/graph_builder/model_desc.py", line 262, in _build_graph_get_cost | |
ret = self.build_graph(*inputs) | |
File "tensorpack-mask-rcnn/MaskRCNN/train.py", line 124, in build_graph | |
features = self.backbone(images) | |
File "tensorpack-mask-rcnn/MaskRCNN/train.py", line 193, in backbone | |
p23456 = fpn_model('fpn', c2345, fp16=self.fp16) | |
File "tensorpack-mask-rcnn/tensorpack/models/registry.py", line 128, in wrapped_func | |
outputs = func(*args, **actual_args) | |
File "tensorpack-mask-rcnn/MaskRCNN/model_fpn.py", line 80, in fpn_model | |
lat = lat + upsample2x('upsample_lat{}'.format(6 - idx), lat_sum_5432[-1]) | |
File "tensorpack-mask-rcnn/MaskRCNN/model_fpn.py", line 57, in upsample2x | |
data_format='channels_first') | |
File "tensorpack-mask-rcnn/tensorpack/models/registry.py", line 128, in wrapped_func | |
outputs = func(*args, **actual_args) | |
File "tensorpack-mask-rcnn/tensorpack/models/pool.py", line 130, in FixedUnPooling | |
ret = tf.tensordot(x, mat, axes=1) # bxcxhxwxshxsw | |
File "usr/local/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py", line 3641, in tensordot | |
ab_matmul = matmul(a_reshape, b_reshape) | |
File "usr/local/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py", line 2513, in matmul | |
a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name) | |
File "usr/local/lib/python3.6/site-packages/tensorflow/python/ops/gen_math_ops.py", line 5675, in mat_mul | |
name=name) | |
File "usr/local/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 800, in _apply_op_helper | |
op_def=op_def) | |
File "usr/local/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func | |
return func(*args, **kwargs) | |
File "usr/local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3473, in create_op | |
op_def=op_def) | |
File "usr/local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1961, in __init__ | |
self._traceback = tf_stack.extract_stack() | |
File "tensorpack-mask-rcnn/tensorpack/train/tower.py", line 216, in setup_graph | |
train_callbacks = self._setup_graph(input, get_cost_fn, get_opt_fn) | |
File "tensorpack-mask-rcnn/tensorpack/train/trainers.py", line 410, in _setup_graph | |
grads = self._make_get_grad_fn(input, get_cost_fn, get_opt_fn)() | |
File "tensorpack-mask-rcnn/tensorpack/train/tower.py", line 283, in get_grad_fn | |
return compute_grad_from_inputs(*inputs) | |
File "tensorpack-mask-rcnn/tensorpack/train/tower.py", line 247, in compute_grad_from_inputs | |
cost = get_cost_fn(*inputs) | |
File "tensorpack-mask-rcnn/tensorpack/tfutils/tower.py", line 286, in __call__ | |
output = self._tower_fn(*args) | |
File "tensorpack-mask-rcnn/tensorpack/graph_builder/model_desc.py", line 262, in _build_graph_get_cost | |
ret = self.build_graph(*inputs) | |
File "tensorpack-mask-rcnn/MaskRCNN/train.py", line 124, in build_graph | |
features = self.backbone(images) | |
File "tensorpack-mask-rcnn/MaskRCNN/train.py", line 193, in backbone | |
p23456 = fpn_model('fpn', c2345, fp16=self.fp16) | |
File "tensorpack-mask-rcnn/tensorpack/models/registry.py", line 128, in wrapped_func | |
outputs = func(*args, **actual_args) | |
File "tensorpack-mask-rcnn/MaskRCNN/model_fpn.py", line 80, in fpn_model | |
lat = lat + upsample2x('upsample_lat{}'.format(6 - idx), lat_sum_5432[-1]) | |
File "tensorpack-mask-rcnn/MaskRCNN/model_fpn.py", line 57, in upsample2x | |
data_format='channels_first') | |
File "tensorpack-mask-rcnn/tensorpack/models/registry.py", line 128, in wrapped_func | |
outputs = func(*args, **actual_args) | |
File "tensorpack-mask-rcnn/tensorpack/models/pool.py", line 130, in FixedUnPooling | |
ret = tf.tensordot(x, mat, axes=1) # bxcxhxwxshxsw | |
File "usr/local/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py", line 3641, in tensordot | |
ab_matmul = matmul(a_reshape, b_reshape) | |
File "usr/local/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py", line 2513, in matmul | |
a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name) | |
File "usr/local/lib/python3.6/site-packages/tensorflow/python/ops/gen_math_ops.py", line 5675, in mat_mul | |
name=name) | |
File "usr/local/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 800, in _apply_op_helper | |
op_def=op_def) | |
File "usr/local/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func | |
return func(*args, **kwargs) | |
File "usr/local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3473, in create_op | |
op_def=op_def) | |
File "usr/local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1961, in __init__ | |
self._traceback = tf_stack.extract_stack() | |
-------------------------------------------------------------------------- | |
mpirun.real noticed that process rank 7 with PID 0 on node f1f0059c329e exited on signal 6 (Aborted). | |
-------------------------------------------------------------------------- |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment