Hello. I encountered to the same problem while analyzing Nomad application via Callgrind tool.
Valgrind versions tested:
- valgrind-3.24.0 (Arch Linux, pacman)
commit
2d09ef48e
from the master branch.
Reproduction command and the resulting log:
# valgrind --tool=callgrind /tmp/nomad -h
...
Callgrind: threads.c:247 (vgCallgrind_post_signal): Assertion 'sigNum == vgCallgrind_current_state.sig' failed.
host stacktrace:
==486603== at 0x5801F79F: show_sched_status_wrk.lto_priv.0 (m_libcassert.c:407)
==486603== by 0x5801F7FF: report_and_quit (m_libcassert.c:478)
==486603== by 0x5801F992: vgPlain_assert_fail (m_libcassert.c:544)
==486603== by 0x58023021: vgCallgrind_post_signal (threads.c:247)
==486603== by 0x5807A256: vgSysWrap_amd64_linux_sys_rt_sigreturn_before (sigframe-amd64-linux.c:599)
==486603== by 0x581DFD54: handle_syscall.isra.0 (syswrap-main.c:2358)
==486603== by 0x5806956E: vgPlain_scheduler (scheduler.c:1582)
==486603== by 0x580B30A9: run_a_thread_NORETURN (syswrap-linux.c:102)
I attempted to isolate the location of the bug and to determine the root cause. In my opinion, I achieved some noticeable results.
I have inserted additional debug prints (using VG_(printf)
) into the functions listed below. Therefore, I am able to capture more necessary context. See attached file debug-prints-339160-commit-2d09ef48e.patch .
- unwind_thread inside the loop
- CLG_(pop_call_stack)
- CLG_(pre_signal)
- CLG_(run_post_signal_on_call_stack_bottom)
- CLG_(post_signal)
- VG_(sigframe_destroy)
The callgrind output with additional debug print:
# ./dist/bin/valgrind --tool=callgrind /tmp/nomad -h
==575331== Callgrind, a call-graph generating cache profiler
==575331== Copyright (C) 2002-2017, and GNU GPL'd, by Josef Weidendorfer et al.
==575331== Using Valgrind-3.25.0.GIT and LibVEX; rerun with -h for copyright info
==575331== Command: /tmp/nomad -h
==575331==
==575331== For interactive control, run 'callgrind_control -h'.
BB# 4793874
-- pre_signal(TID 1, sig 23, alt_st yes)
=== pop_call_stack current_tid=1 CLG_(current_state).sig=23
=== run_post_signal_on_call_stack_bottom
-- post_signal(TID 1, sig 23)
=== post_signal after CLG_(switch_thread)(tid); current_tid=1 tid=1 sigNum=23 CLG_(current_state).sig=23
=== post_signal before exec_state_restore es->sig=0
=== sigframe_destroy before VG_TRACK( post_deliver_signal, tid, sigNo )
BB# 4793961
-- post_signal(TID 1, sig 23)
=== post_signal after CLG_(switch_thread)(tid); current_tid=1 tid=1 sigNum=23 CLG_(current_state).sig=0
0x000000000003dcd0
Callgrind: threads.c:262 (vgCallgrind_post_signal): Assertion 'sigNum == vgCallgrind_current_state.sig' failed.
Two points to consider:
- There is the second
post_signal
call before the assertion fail. During the second call variableCLG_(current_state)
appears reset, but thesigNum
(post_signal argument) still retains old value "23". post_signal
is called twice. The first trace ispop_call_stack -> run_post_signal_on_call_stack_bottom -> post_signal
. And the second:sigframe_destroy -> track_post_deliver_signal -> post_signal
.
I have not investigated this bug further, but based on the information above, I suppose that the signal is processed (called post_signal
) twice, which is incorrect behavior. The first processing may be valid, but the second processing causes Callgrind to crash with an assertion error.
I wrote a workaround in order to bypass the second post_signal
(wrong one), that is being called from the sigframe_destroy
. I added a new function, CLG_(post_signal_delivered)
, which wraps CLG_(post_signal)
and checks is CLG_(current_state).sig
equal to zero.
I have prepared a patch-file (tested only on commit 2d09ef48e
).
See file bug-339160-workaround-commit-2d09ef48e .
The results:
# ./dist/bin/valgrind --tool=callgrind /tmp/nomad -h
...
Triggered workaround to bypass CLG_(post_signal), see https://bugs.kde.org/show_bug.cgi?id=339160
Triggered workaround to bypass CLG_(post_signal), see https://bugs.kde.org/show_bug.cgi?id=339160
Triggered workaround to bypass CLG_(post_signal), see https://bugs.kde.org/show_bug.cgi?id=339160
Usage: nomad [-version] [-help] [-autocomplete-(un)install] <command> [args]
Common commands:
run Run a new job or update an existing job
......
agent Runs a Nomad agent
Other commands:
acl Interact with ACL policies and tokens
action Run a pre-defined command from a given context
......
version Prints the Nomad version
volume Interact with volumes
==639712==
==639712== Events : Ir
==639712== Collected : 619290724
==639712==
==639712== I refs: 619,290,724
I think my workaround cannot be treated as a bugfix. So, I provided patch-file here without making Merge Request. However, if you encounter the same problem, you can apply the patch and test the Callgrind behavior.
To fully resolve the issue, the next steps should focus on identifying why post_signal
is being called twice.