Created
December 18, 2019 06:12
-
-
Save sparticlesteve/7307694f89329c277e16e452b524fefa to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
$ srun -n 8 -c 10 -u -l python test_ddp.py --backend mpi | |
3: Initialized rank 3 local-rank 3 size 8 | |
1: Initialized rank 1 local-rank 1 size 8 | |
5: Initialized rank 5 local-rank 5 size 8 | |
7: Initialized rank 7 local-rank 7 size 8 | |
2: Initialized rank 2 local-rank 2 size 8 | |
4: Initialized rank 4 local-rank 4 size 8 | |
6: Initialized rank 6 local-rank 6 size 8 | |
0: Initialized rank 0 local-rank 0 size 8 | |
3: Generating a batch of data | |
1: Generating a batch of data | |
5: Generating a batch of data | |
7: Generating a batch of data | |
4: Generating a batch of data | |
6: Generating a batch of data | |
2: Generating a batch of data | |
0: Generating a batch of data | |
7: Constructing model | |
3: Constructing model | |
5: Constructing model | |
1: Constructing model | |
0: Constructing model | |
2: Constructing model | |
6: Constructing model | |
4: Constructing model | |
0: [1576565271.426514] [cgpu06:45376:0] cuda_ipc_md.c:62 UCX ERROR cuCtxGetDevice(&cu_device) is failed. ret:invalid device context | |
0: [1576565271.426547] [cgpu06:45376:0] ucp_rkey.c:250 UCX ERROR Failed to unpack remote key from remote md[4]: Input/output error | |
0: [cgpu06:45376:0:45523] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x20) | |
0: ==== backtrace ==== | |
0: 0 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucs.so.0(+0x2293c) [0x2aab1e94293c] | |
0: 1 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucs.so.0(+0x22ba4) [0x2aab1e942ba4] | |
0: 2 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libuct.so.0(uct_rkey_release+0xe) [0x2aab1e70751e] | |
0: 3 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucp.so.0(ucp_rkey_destroy+0x34) [0x2aab14e46804] | |
0: 4 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucp.so.0(ucp_ep_rkey_unpack+0x341) [0x2aab14e465e1] | |
0: 5 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucp.so.0(ucp_rndv_rtr_handler+0x1e8) [0x2aab14e63558] | |
0: 6 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libuct.so.0(+0xed99) [0x2aab1e70ad99] | |
0: 7 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libuct.so.0(uct_mm_iface_progress+0x6 | |
0: e) [0x2aab1e70af7e] | |
0: 8 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucp.so.0(ucp_worker_progress+0x3a) [0x2aab14e4b4fa] | |
0: 9 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_progress+0x17) [0x2aab14a206c7] | |
0: 10 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/libopen-pal.so.40(opal_progress+0x2c) [0x2aaaff5f584c] | |
0: 11 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/libopen-pal.so.40(ompi_sync_wait_mt+0xc5) [0x2aaaff5fc255] | |
0: 12 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/libmpi.so.40(ompi_request_default_wait_all+0x231) [0x2aaabb353331] | |
0: 13 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/libmpi.so.40(ompi_coll_base_bcast_intra_generic+0x46f) [0x2aaabb3ad97f] | |
0: 14 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/libmpi.so.40(ompi_coll_base_bcast_intra_pipeline+0xc8) [0x2aaabb3adce8] | |
0: 15 | |
0: /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_dec_fixed+0x13e) [0x2aab2796e7de] | |
0: 16 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/libmpi.so.40(MPI_Bcast+0x116) [0x2aaabb36e776] | |
0: 17 /global/cscratch1/sd/sfarrell/conda/pytorch/v1.3.1-gpu/lib/python3.6/site-packages/torch/lib/libtorch_python.so(+0x834da8) [0x2aaabae3bda8] | |
0: 18 /global/cscratch1/sd/sfarrell/conda/pytorch/v1.3.1-gpu/lib/python3.6/site-packages/torch/lib/libtorch_python.so(_ZN4c10d15ProcessGroupMPI7runLoopEv+0x134) [0x2aaabae39864] | |
0: 19 /usr/lib64/libstdc++.so.6(+0xc338f) [0x2aaabb70538f] | |
0: 20 /lib64/libpthread.so.0(+0x7569) [0x2aaaaacda569] | |
0: 21 /lib64/libc.so.6(clone+0x3f) [0x2aaaaafe9a2f] | |
0: =================== | |
2: [1576565271.431356] [cgpu06:45378:0] mm_posix.c:449 UCX ERROR Error returned from open in attach. Permission denied. File name is: /proc/45376/fd/54 | |
2: [cgpu06:45378:0:45521] mm_ep.c:168 Fatal: Failed to attach to remote mmid:194888436023734. Shared memory error | |
2: ==== backtrace ==== | |
2: 0 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucs.so.0(ucs_fatal_error_message+0x99) [0x2aab1693fcc9] | |
2: 1 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucs.so.0(ucs_fatal_error_format+0xd6) [0x2aab1693fda6] | |
2: 2 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libuct.so.0(+0x111c4) [0x2aab1670c1c4] | |
2: 3 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libuct.so.0(uct_mm_ep_am_bcopy+0x17c) [0x2aab1670c63c] | |
2: 4 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucp.so.0(+0x4aa9d) [0x2aab14e76a9d] | |
2: 5 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucp.so.0(ucp_wireup_msg_progress+0x7a) [0x2aab14e78d0a] | |
2: 6 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucp.so.0(+0x4d125) [0x2aab14e79125] | |
2: 7 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucp.so.0(+0x4e59a | |
2: ) [0x2aab14e7a59a] | |
2: 8 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucs.so.0(+0x1a39a) [0x2aab1693939a] | |
2: 9 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucp.so.0(ucp_worker_progress+0x3a) [0x2aab14e4b4fa] | |
2: 10 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_progress+0x17) [0x2aab14a206c7] | |
2: 11 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/libopen-pal.so.40(opal_progress+0x2c) [0x2aaaff5f584c] | |
2: 12 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/libopen-pal.so.40(ompi_sync_wait_mt+0xc5) [0x2aaaff5fc255] | |
2: 13 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/libmpi.so.40(ompi_request_default_wait+0x1e2) [0x2aaabb352dc2] | |
2: 14 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/libmpi.so.40(ompi_coll_base_bcast_intra_generic+0x3e3) [0x2aaabb3ad8f3] | |
2: 15 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0- | |
2: cuda-10.1.168/lib/libmpi.so.40(ompi_coll_base_bcast_intra_pipeline+0xc8) [0x2aaabb3adce8] | |
2: 16 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_dec_fixed+0x13e) [0x2aab1f96d7de] | |
2: 17 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/libmpi.so.40(MPI_Bcast+0x116) [0x2aaabb36e776] | |
2: 18 /global/cscratch1/sd/sfarrell/conda/pytorch/v1.3.1-gpu/lib/python3.6/site-packages/torch/lib/libtorch_python.so(+0x834da8) [0x2aaabae3bda8] | |
2: 19 /global/cscratch1/sd/sfarrell/conda/pytorch/v1.3.1-gpu/lib/python3.6/site-packages/torch/lib/libtorch_python.so(_ZN4c10d15ProcessGroupMPI7runLoopEv+0x134) [0x2aaabae39864] | |
2: 20 /usr/lib64/libstdc++.so.6(+0xc338f) [0x2aaabb70538f] | |
2: 21 /lib64/libpthread.so.0(+0x7569) [0x2aaaaacda569] | |
2: 22 /lib64/libc.so.6(clone+0x3f) [0x2aaaaafe9a2f] | |
2: =================== | |
2: [cgpu06:45378] *** Process received signal *** | |
2: [cgpu06:45378] Signal: Aborted (6) | |
2: [cgpu06:45378] Signal code: (-6) | |
2: [cgpu06:45378] [ 0] /lib64/libpthread.so.0(+0x12360)[0x2aaaaace5360] | |
2: [cgpu06:45378] [ 1] | |
2: /lib64/libc.so.6(gsignal+0x110)[0x2aaaaaf27160] | |
2: [cgpu06:45378] [ 2] | |
2: /lib64/libc.so.6(abort+0x151)[0x2aaaaaf28741] | |
2: [cgpu06:45378] [ 3] /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucs.so.0(+0x20cce)[0x2aab1693fcce] | |
2: [cgpu06:45378] [ 4] | |
2: /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucs.so.0(ucs_fatal_error_format+0xd6)[0x2aab1693fda6] | |
2: [cgpu06:45378] [ 5] /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libuct.so.0(+0x111c4)[0x2aab1670c1c4] | |
2: [cgpu06:45378] [ 6] /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libuct.so.0(uct_mm_ep_am_bcopy+0x17c)[0x2aab1670c63c] | |
2: [cgpu06:45378] [ 7] /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucp.so.0(+0x4aa9d)[0x2aab14e76a9d] | |
2: [cgpu06:45378] [ 8] /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucp.so.0(ucp_wireup_msg_progress+0x7a)[0x2aab14e78d0a] | |
2: [cgpu06:45378] [ 9] | |
2: /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucp.so.0(+0x4d125)[0x2aab14e79125] | |
2: [cgpu06:45378] [10] /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucp.so.0(+0x4e59a)[0x2aab14e7a59a] | |
2: [cgpu06:45378] [11] /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucs.so.0(+0x1a39a)[0x2aab1693939a] | |
2: [cgpu06:45378] [12] /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucp.so.0(ucp_worker_progress+0x3a)[0x2aab14e4b4fa] | |
2: [cgpu06:45378] [13] /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_progress+0x17)[0x2aab14a206c7] | |
2: [cgpu06:45378] [14] | |
2: /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/libopen-pal.so.40(opal_progress+0x2c)[0x2aaaff5f584c] | |
2: [cgpu06:45378] [15] | |
2: /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/libopen-pal.so.40(ompi_sync_wait_mt+0xc5)[0x2aaaff5fc255] | |
2: [cgpu06:45378] [16] /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/libmpi.so.40(ompi_request_default_wait+0x1e2)[0x2aaabb352dc2] | |
2: [cgpu06:45378] [17] | |
2: /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/libmpi.so.40(ompi_coll_base_bcast_intra_generic+0x3e3)[0x2aaabb3ad8f3] | |
2: [cgpu06:45378] [18] /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/libmpi.so.40(ompi_coll_base_bcast_intra_pipeline+0xc8)[0x2aaabb3adce8] | |
2: [cgpu06:45378] [19] | |
2: /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_dec_fixed+0x13e)[0x2aab1f96d7de] | |
2: [cgpu06:45378] [20] | |
2: /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/libmpi.so.40(MPI_Bcast+0x116)[0x2aaabb36e776] | |
2: [cgpu06:45378] [21] | |
2: /global/cscratch1/sd/sfarrell/conda/pytorch/v1.3.1-gpu/lib/python3.6/site-packages/torch/lib/libtorch_python.so(+0x834da8)[0x2aaabae3bda8] | |
2: [cgpu06:45378] [22] | |
2: /global/cscratch1/sd/sfarrell/conda/pytorch/v1.3.1-gpu/lib/python3.6/site-packages/torch/lib/libtorch_python.so(_ZN4c10d15ProcessGroupMPI7runLoopEv+0x134)[0x2aaabae39864] | |
2: [cgpu06:45378] [23] | |
2: /usr/lib64/libstdc++.so.6(+0xc338f)[0x2aaabb70538f] | |
2: [cgpu06:45378] [24] /lib64/libpthread.so.0(+0x7569)[0x2aaaaacda569] | |
2: [cgpu06:45378] [25] | |
2: /lib64/libc.so.6(clone+0x3f)[0x2aaaaafe9a2f] | |
2: [cgpu06:45378] *** End of error message *** | |
4: [1576565271.438377] [cgpu06:45380:0] mm_posix.c:449 UCX ERROR Error returned from open in attach. Permission denied. File name is: /proc/45376/fd/54 | |
4: [cgpu06:45380:0:45520] mm_ep.c:168 Fatal: Failed to attach to remote mmid:194888436023734. Shared memory error | |
4: ==== backtrace ==== | |
4: 0 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucs.so.0(ucs_fatal_error_message+0x99) [0x2aab1693fcc9] | |
4: 1 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucs.so.0(ucs_fatal_error_format+0xd6) [0x2aab1693fda6] | |
4: 2 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libuct.so.0(+0x111c4) [0x2aab1670c1c4] | |
4: 3 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libuct.so.0(uct_mm_ep_am_bcopy+0x17c) [0x2aab1670c63c] | |
4: 4 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucp.so.0(+0x4aa9d) [0x2aab14e76a9d] | |
4: 5 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucp.so.0(ucp_wireup_msg_progress+0x7a) [0x2aab14e78d0a] | |
4: 6 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucp.so.0(+0x4d125) [0x2aab14e79125] | |
4: 7 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucp.so.0(+0x4e59a) [0x2aab14e7a59a] | |
4: 8 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucs.so.0(+0x1a39a) [0x2aab1693939a] | |
4: 9 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucp.so.0(ucp_worker_progress+0x3a) [0x2aab14e4b4fa] | |
4: 10 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_progress+0x17) [0x2aab14a206c7] | |
4: 11 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/libopen-pal.so.40(opal_progress+0x2c) [ | |
4: 0x2aaaff5f584c] | |
4: 12 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/libopen-pal.so.40(ompi_sync_wait_mt+0xc5) [0x2aaaff5fc255] | |
4: 13 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/libmpi.so.40(ompi_request_default_wait+0x1e2) [0x2aaabb352dc2] | |
4: 14 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/libmpi.so.40(ompi_coll_base_bcast_intra_generic+0x3e3) [0x2aaabb3ad8f3] | |
4: 15 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/libmpi.so.40(ompi_coll_base_bcast_intra_pipeline+0xc8) [0x2aaabb3adce8] | |
4: 16 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_dec_fixed+0x13e) [0x2aab1f96d7de] | |
4: 17 /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/libmpi.so.40(MPI_Bcast+0x116) [0x2aaabb36e776] | |
4: 18 /global/cscratch1/sd/sfarrell/conda/pytorch/v1.3.1-gpu/lib/python3.6/site-packages/torch/lib/libtorch_python.so(+0x834da8) [0x2aaabae3bda8] | |
4: 19 /global/cscratch1/sd/sfarrell/conda/pytorch/v1.3.1-gpu/lib/python3.6/site-packages/torch/lib/libtorch_python.so(_ZN4c10d15ProcessGroupMPI7runLoopEv+0x134) [0x2aaabae39864] | |
4: 20 /usr/lib64/libstdc++.so.6(+0xc338f) [0x2aaabb70538f] | |
4: 21 /lib64/libpthread.so.0(+0x7569) [0x2aaaaacda569] | |
4: 22 /lib64/libc.so.6(clone+0x3f) [0x2aaaaafe9a2f] | |
4: =================== | |
4: [cgpu06:45380] *** Process received signal *** | |
4: [cgpu06:45380] Signal: Aborted (6) | |
4: [cgpu06:45380] Signal code: (-6) | |
4: [cgpu06:45380] [ 0] | |
4: /lib64/libpthread.so.0(+0x12360)[0x2aaaaace5360] | |
4: [cgpu06:45380] [ 1] | |
4: /lib64/libc.so.6(gsignal+0x110)[0x2aaaaaf27160] | |
4: [cgpu06:45380] [ 2] /lib64/libc.so.6(abort+0x151)[0x2aaaaaf28741] | |
4: [cgpu06:45380] [ 3] | |
4: /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucs.so.0(+0x20cce)[0x2aab1693fcce] | |
4: [cgpu06:45380] [ 4] | |
4: /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucs.so.0(ucs_fatal_error_format+0xd6)[0x2aab1693fda6] | |
4: [cgpu06:45380] [ 5] /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libuct.so.0(+0x111c4)[0x2aab1670c1c4] | |
4: [cgpu06:45380] [ 6] /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libuct.so.0(uct_mm_ep_am_bcopy+0x17c)[0x2aab1670c63c] | |
4: [cgpu06:45380] [ 7] /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucp.so.0(+0x4aa9d)[0x2aab14e76a9d] | |
4: [cgpu06:45380] [ 8] /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucp.so.0(ucp_wireup_msg_progress+0x7a)[0x2aab14e78d0a] | |
4: [cgpu06:45380] [ 9] /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucp.so.0(+0x4d125)[0x2aab14e79125] | |
4: [cgpu06:45380] [10] /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucp.so.0(+0x4e59a)[0x2aab14e7a59a] | |
4: [cgpu06:45380] [11] | |
4: /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucs.so.0(+0x1a39a)[0x2aab1693939a] | |
4: [cgpu06:45380] [12] /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/ucx/lib/libucp.so.0(ucp_worker_progress+0x3a)[0x2aab14e4b4fa] | |
4: [cgpu06:45380] [13] | |
4: /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_progress+0x17)[0x2aab14a206c7] | |
4: [cgpu06:45380] [14] | |
4: /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/libopen-pal.so.40(opal_progress+0x2c)[0x2aaaff5f584c] | |
4: [cgpu06:45380] [15] | |
4: /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/libopen-pal.so.40(ompi_sync_wait_mt+0xc5)[0x2aaaff5fc255] | |
4: [cgpu06:45380] [16] | |
4: /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/libmpi.so.40(ompi_request_default_wait+0x1e2)[0x2aaabb352dc2] | |
4: [cgpu06:45380] [17] | |
4: /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/libmpi.so.40(ompi_coll_base_bcast_intra_generic+0x3e3)[0x2aaabb3ad8f3] | |
4: [cgpu06:45380] [18] | |
4: /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/libmpi.so.40(ompi_coll_base_bcast_intra_pipeline+0xc8)[0x2aaabb3adce8] | |
4: [cgpu06:45380] [19] | |
4: /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_dec_fixed+0x13e)[0x2aab1f96d7de] | |
4: [cgpu06:45380] [20] | |
4: /usr/common/software/openmpi/4.0.1-ucx-1.6/gnu-7.3.0-cuda-10.1.168/lib/libmpi.so.40(MPI_Bcast+0x116)[0x2aaabb36e776] | |
4: [cgpu06:45380] [21] | |
4: /global/cscratch1/sd/sfarrell/conda/pytorch/v1.3.1-gpu/lib/python3.6/site-packages/torch/lib/libtorch_python.so(+0x834da8)[0x2aaabae3bda8] | |
4: [cgpu06:45380] [22] | |
4: /global/cscratch1/sd/sfarrell/conda/pytorch/v1.3.1-gpu/lib/python3.6/site-packages/torch/lib/libtorch_python.so(_ZN4c10d15ProcessGroupMPI7runLoopEv+0x134)[0x2aaabae39864] | |
4: [cgpu06:45380] [23] | |
4: /usr/lib64/libstdc++.so.6(+0xc338f)[0x2aaabb70538f] | |
4: [cgpu06:45380] [24] | |
4: /lib64/libpthread.so.0(+0x7569)[0x2aaaaacda569] | |
4: [cgpu06:45380] [25] | |
4: /lib64/libc.so.6(clone+0x3f)[0x2aaaaafe9a2f] | |
4: [cgpu06:45380] *** End of error message *** | |
srun: error: cgpu06: task 0: Segmentation fault | |
srun: Terminating job step 358157.14 | |
srun: error: cgpu06: tasks 2,4: Aborted | |
0: slurmstepd: error: *** STEP 358157.14 ON cgpu06 CANCELLED AT 2019-12-16T22:47:52 *** | |
srun: error: cgpu06: tasks 3,7: Terminated | |
srun: error: cgpu06: tasks 5-6: Terminated | |
srun: error: cgpu06: task 1: Terminated | |
srun: Force Terminated job step 358157.14 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment