Skip to content

Instantly share code, notes, and snippets.

@Toliak
Last active March 6, 2025 11:50
Show Gist options
  • Save Toliak/dc9d49424f2ce27cbc7a00b7de3dc2eb to your computer and use it in GitHub Desktop.
Save Toliak/dc9d49424f2ce27cbc7a00b7de3dc2eb to your computer and use it in GitHub Desktop.
Attachments to the Valgrind issue comment https://bugs.kde.org/show_bug.cgi?id=339160#c4

Hello. I encountered to the same problem while analyzing Nomad application via Callgrind tool.

Bug description

Valgrind versions tested:

  • valgrind-3.24.0 (Arch Linux, pacman) commit 2d09ef48e from the master branch.

Reproduction command and the resulting log:

# valgrind --tool=callgrind /tmp/nomad -h

...

Callgrind: threads.c:247 (vgCallgrind_post_signal): Assertion 'sigNum == vgCallgrind_current_state.sig' failed.

host stacktrace:
==486603==    at 0x5801F79F: show_sched_status_wrk.lto_priv.0 (m_libcassert.c:407)
==486603==    by 0x5801F7FF: report_and_quit (m_libcassert.c:478)
==486603==    by 0x5801F992: vgPlain_assert_fail (m_libcassert.c:544)
==486603==    by 0x58023021: vgCallgrind_post_signal (threads.c:247)
==486603==    by 0x5807A256: vgSysWrap_amd64_linux_sys_rt_sigreturn_before (sigframe-amd64-linux.c:599)
==486603==    by 0x581DFD54: handle_syscall.isra.0 (syswrap-main.c:2358)
==486603==    by 0x5806956E: vgPlain_scheduler (scheduler.c:1582)
==486603==    by 0x580B30A9: run_a_thread_NORETURN (syswrap-linux.c:102)

Localization

I attempted to isolate the location of the bug and to determine the root cause. In my opinion, I achieved some noticeable results.

I have inserted additional debug prints (using VG_(printf)) into the functions listed below. Therefore, I am able to capture more necessary context. See attached file debug-prints-339160-commit-2d09ef48e.patch .

The callgrind output with additional debug print:

# ./dist/bin/valgrind --tool=callgrind /tmp/nomad -h
==575331== Callgrind, a call-graph generating cache profiler
==575331== Copyright (C) 2002-2017, and GNU GPL'd, by Josef Weidendorfer et al.
==575331== Using Valgrind-3.25.0.GIT and LibVEX; rerun with -h for copyright info
==575331== Command: /tmp/nomad -h
==575331== 
==575331== For interactive control, run 'callgrind_control -h'.
BB# 4793874
-- pre_signal(TID 1, sig 23, alt_st yes)
=== pop_call_stack current_tid=1  CLG_(current_state).sig=23
=== run_post_signal_on_call_stack_bottom
-- post_signal(TID 1, sig 23)
=== post_signal after CLG_(switch_thread)(tid); current_tid=1 tid=1 sigNum=23 CLG_(current_state).sig=23
=== post_signal before exec_state_restore es->sig=0
=== sigframe_destroy before VG_TRACK( post_deliver_signal, tid, sigNo )
BB# 4793961
-- post_signal(TID 1, sig 23)
=== post_signal after CLG_(switch_thread)(tid); current_tid=1 tid=1 sigNum=23 CLG_(current_state).sig=0
0x000000000003dcd0

Callgrind: threads.c:262 (vgCallgrind_post_signal): Assertion 'sigNum == vgCallgrind_current_state.sig' failed.

Two points to consider:

  1. There is the second post_signal call before the assertion fail. During the second call variable CLG_(current_state) appears reset, but the sigNum (post_signal argument) still retains old value "23".
  2. post_signal is called twice. The first trace is pop_call_stack -> run_post_signal_on_call_stack_bottom -> post_signal. And the second: sigframe_destroy -> track_post_deliver_signal -> post_signal.

I have not investigated this bug further, but based on the information above, I suppose that the signal is processed (called post_signal) twice, which is incorrect behavior. The first processing may be valid, but the second processing causes Callgrind to crash with an assertion error.

Workaround

I wrote a workaround in order to bypass the second post_signal (wrong one), that is being called from the sigframe_destroy. I added a new function, CLG_(post_signal_delivered), which wraps CLG_(post_signal) and checks is CLG_(current_state).sig equal to zero.

I have prepared a patch-file (tested only on commit 2d09ef48e). See file bug-339160-workaround-commit-2d09ef48e .

The results:

# ./dist/bin/valgrind --tool=callgrind /tmp/nomad -h
...

Triggered workaround to bypass CLG_(post_signal), see https://bugs.kde.org/show_bug.cgi?id=339160
Triggered workaround to bypass CLG_(post_signal), see https://bugs.kde.org/show_bug.cgi?id=339160
Triggered workaround to bypass CLG_(post_signal), see https://bugs.kde.org/show_bug.cgi?id=339160
Usage: nomad [-version] [-help] [-autocomplete-(un)install] <command> [args]

Common commands:
    run         Run a new job or update an existing job
    ......
    agent       Runs a Nomad agent

Other commands:
    acl                 Interact with ACL policies and tokens
    action              Run a pre-defined command from a given context
    ......
    version             Prints the Nomad version
    volume              Interact with volumes
==639712== 
==639712== Events    : Ir
==639712== Collected : 619290724
==639712== 
==639712== I   refs:      619,290,724

Validation and discussion

I think my workaround cannot be treated as a bugfix. So, I provided patch-file here without making Merge Request. However, if you encounter the same problem, you can apply the patch and test the Callgrind behavior.

To fully resolve the issue, the next steps should focus on identifying why post_signal is being called twice.

diff --git a/callgrind/global.h b/callgrind/global.h
index 4a988a481..655183c2d 100644
--- a/callgrind/global.h
+++ b/callgrind/global.h
@@ -773,6 +773,7 @@ void CLG_(init_exec_stack)(exec_stack*);
void CLG_(copy_current_exec_stack)(exec_stack*);
void CLG_(set_current_exec_stack)(exec_stack*);
void CLG_(pre_signal)(ThreadId tid, Int sigNum, Bool alt_stack);
+void CLG_(post_signal_delivered)(ThreadId tid, Int sigNum);
void CLG_(post_signal)(ThreadId tid, Int sigNum);
void CLG_(run_post_signal_on_call_stack_bottom)(void);
diff --git a/callgrind/main.c b/callgrind/main.c
index 39aa230ed..335d7da0b 100644
--- a/callgrind/main.c
+++ b/callgrind/main.c
@@ -2140,7 +2140,7 @@ void CLG_(pre_clo_init)(void)
VG_(track_start_client_code) ( & clg_start_client_code_callback );
VG_(track_pre_deliver_signal) ( & CLG_(pre_signal) );
- VG_(track_post_deliver_signal)( & CLG_(post_signal) );
+ VG_(track_post_deliver_signal)( & CLG_(post_signal_delivered) );
CLG_(set_clo_defaults)();
diff --git a/callgrind/threads.c b/callgrind/threads.c
index 624588a26..4425fcbe0 100644
--- a/callgrind/threads.c
+++ b/callgrind/threads.c
@@ -234,6 +234,17 @@ void CLG_(run_post_signal_on_call_stack_bottom)(void)
CLG_(post_signal)( CLG_(current_tid), CLG_(current_state).sig );
}
+void CLG_(post_signal_delivered)(ThreadId tid, Int sigNum)
+{
+ if (tid == CLG_(current_tid) && CLG_(current_state).sig == 0) {
+ // Already processed???
+ VG_(printf)("Triggered workaround to bypass CLG_(post_signal), see https://bugs.kde.org/show_bug.cgi?id=339160\n");
+ return;
+ }
+
+ CLG_(post_signal)(tid, sigNum);
+}
+
void CLG_(post_signal)(ThreadId tid, Int sigNum)
{
exec_state* es;
diff --git a/callgrind/callstack.c b/callgrind/callstack.c
index fc1d6eeab..722511799 100644
--- a/callgrind/callstack.c
+++ b/callgrind/callstack.c
@@ -316,6 +316,7 @@ void CLG_(pop_call_stack)(void)
call_entry* lower_entry;
if (CLG_(current_state).sig >0) {
+ VG_(printf)("=== pop_call_stack current_tid=%u CLG_(current_state).sig=%u\n", CLG_(current_tid), CLG_(current_state).sig);
/* Check if we leave a signal handler; this can happen when
* calling longjmp() in the handler */
CLG_(run_post_signal_on_call_stack_bottom)();
diff --git a/callgrind/main.c b/callgrind/main.c
index 39aa230ed..0226468e7 100644
--- a/callgrind/main.c
+++ b/callgrind/main.c
@@ -1428,8 +1428,10 @@ static
void unwind_thread(thread_info* t)
{
/* unwind signal handlers */
- while(CLG_(current_state).sig !=0)
+ while(CLG_(current_state).sig !=0){
+ VG_(printf("=== unwind_thread loop inside\n"));
CLG_(post_signal)(CLG_(current_tid),CLG_(current_state).sig);
+ }
/* unwind regular call stack */
while(CLG_(current_call_stack).sp>0)
diff --git a/callgrind/threads.c b/callgrind/threads.c
index 624588a26..4e2d5e438 100644
--- a/callgrind/threads.c
+++ b/callgrind/threads.c
@@ -198,7 +198,7 @@ void CLG_(pre_signal)(ThreadId tid, Int sigNum, Bool alt_stack)
{
exec_state *es;
- CLG_DEBUG(0, ">> pre_signal(TID %u, sig %d, alt_st %s)\n",
+ CLG_DEBUG(-1, ">> pre_signal(TID %u, sig %d, alt_st %s)\n",
tid, sigNum, alt_stack ? "yes":"no");
/* switch to the thread the handler runs in */
@@ -226,6 +226,7 @@ void CLG_(pre_signal)(ThreadId tid, Int sigNum, Bool alt_stack)
*/
void CLG_(run_post_signal_on_call_stack_bottom)(void)
{
+ VG_(printf)("=== run_post_signal_on_call_stack_bottom\n");
exec_state* es = top_exec_state();
CLG_ASSERT(es != 0);
CLG_ASSERT(CLG_(current_state).sig >0);
@@ -239,11 +240,12 @@ void CLG_(post_signal)(ThreadId tid, Int sigNum)
exec_state* es;
UInt fn_number, *pactive;
- CLG_DEBUG(0, ">> post_signal(TID %u, sig %d)\n",
+ CLG_DEBUG(-1, ">> post_signal(TID %u, sig %d)\n",
tid, sigNum);
/* thread switching potentially needed, eg. with instrumentation off */
CLG_(switch_thread)(tid);
+ VG_(printf)("=== post_signal after CLG_(switch_thread)(tid); current_tid=%u tid=%u sigNum=%u CLG_(current_state).sig=%u\n", CLG_(current_tid), tid, sigNum, CLG_(current_state).sig);
CLG_ASSERT(sigNum == CLG_(current_state).sig);
/* Unwind call stack of this signal handler.
@@ -286,6 +288,7 @@ void CLG_(post_signal)(ThreadId tid, Int sigNum)
es->sig = -1;
current_states.sp--;
es = top_exec_state();
+ VG_(printf)("=== post_signal before exec_state_restore es->sig=%u\n", es->sig);
CLG_(current_state).sig = es->sig;
exec_state_restore();
diff --git a/coregrind/m_sigframe/sigframe-amd64-linux.c b/coregrind/m_sigframe/sigframe-amd64-linux.c
index 0e2430609..30333cd90 100644
--- a/coregrind/m_sigframe/sigframe-amd64-linux.c
+++ b/coregrind/m_sigframe/sigframe-amd64-linux.c
@@ -595,6 +595,7 @@ void VG_(sigframe_destroy)( ThreadId tid, Bool isRT )
"VG_(sigframe_destroy) (thread %u): isRT=%d valid magic; RIP=%#llx\n",
tid, isRT, tst->arch.vex.guest_RIP);
+ VG_(printf)("=== sigframe_destroy before VG_TRACK( post_deliver_signal, tid, sigNo )\n");
/* tell the tools */
VG_TRACK( post_deliver_signal, tid, sigNo );
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment