Skip to content

posix/arch.h: implement ARCH_EXCEPT() with abort() for debug convenience#68494

Closed
marc-hb wants to merge 1 commit intozephyrproject-rtos:mainfrom
marc-hb:posix-abort
Closed

posix/arch.h: implement ARCH_EXCEPT() with abort() for debug convenience#68494
marc-hb wants to merge 1 commit intozephyrproject-rtos:mainfrom
marc-hb:posix-abort

Conversation

@marc-hb
Copy link
Copy Markdown
Contributor

@marc-hb marc-hb commented Feb 2, 2024

Flush all messages and invoke abort() when a k_panic() or k_oops() is
hit in native_posix mode.

One of the main purposes of native_posix is to provide debug
convenience. When running in a debugger, abort() stops execution which
provides a backtrace and the ability to inspect all variables.

A good, sample use case is fuzzing failures in SOF, see an example in:
thesofproject/sof#8632

In such a case, this commit adds value even before using a
debugger. Without this commit, confusingly meaningless stack trace:

INFO: seed corpus: files: 1097 min: 1b max: 428b total: 90853b rss: 58Mb
Exiting due to fatal error
==314134== ERROR: libFuzzer: fuzz target exited
    #0 0x81d9637 in __sanitizer_print_stack_trace (zephyr.exe+0x81d9637)
    #1 0x80cc42b in fuzzer::PrintStackTrace() (zephyr.exe+0x80cc42b)
    #2 0x80ab79e in fuzzer::Fuzzer::ExitCallback() FuzzerLoop.cpp.o
    #3 0x80ab864 in fuzzer::Fuzzer::StaticExitCallback() (zephyr.exe+
    #4 0xf783dfe8  (/usr/lib32/libc.so.6+0x3dfe8)
    #5 0xf783e1e6 in exit (/usr/lib32/libc.so.6+0x3e1e6)
    #6 0x82a5488 in posix_exit boards/posix/native_posix/main.c:51:2

SUMMARY: libFuzzer: fuzz target exited

Thanks to this commit the k_panic() location is now immediately
available in test logs without even running anything locally:

INFO: seed corpus: files: 1097 min: 1b max: 428b total: 90853b rss: 58Mb
@ WEST_TOPDIR/sof/src/ipc/ipc3/handler.c:1623
ZEPHYR FATAL ERROR: 3
==315176== ERROR: libFuzzer: deadly signal
LLVMSymbolizer: error reading file: No such file or directory
    #0 0x81d9647 in __sanitizer_print_stack_trace (zephyr.exe+0x81d9647)
    #1 0x80cc43b in fuzzer::PrintStackTrace() (zephyr.exe+0x80cc43b)
    #2 0x80ab6be in fuzzer::Fuzzer::CrashCallback() FuzzerLoop.cpp.o
    #3 0x80ab77b in fuzzer::Fuzzer::StaticCrashSignalCallback()
    #4 0xf7f3159f  (linux-gate.so.1+0x59f)
    #5 0xf7f31578  (linux-gate.so.1+0x578)
    #6 0xf788ea16  (/usr/lib32/libc.so.6+0x8ea16)
    #7 0xf783b316 in raise (/usr/lib32/libc.so.6+0x3b316)
    #8 0xf7822120 in abort (/usr/lib32/libc.so.6+0x22120)
    #9 0x82afbde in ipc_cmd src/ipc/ipc3/handler.c:1623:2

NOTE: libFuzzer has rudimentary signal handlers.
      Combine libFuzzer with AddressSanitizer or similar for better
    crash reports.
SUMMARY: libFuzzer: deadly signal

The full stack trace is now immediately available when running
zephyr.exe in gdb:

./scripts/fuzz.sh  -- -DEXTRA_CFLAGS="-O0 -g3"

gdb build-fuzz/zephyr/zephyr.exe

run
backtrace

 #2  0xf783b317 in raise () from /usr/lib32/libc.so.6
 #3  0xf7822121 in abort () from /usr/lib32/libc.so.6
 #4  0x082afbdf in ipc_cmd (_hdr=0x8b...) at src/ipc/ipc3/handler.c:1623
 #5  0x082fbf4b in ipc_platform_do_cmd (ipc=0x8b161c0)
                                    at src/platform/posix/ipc.c:162
 #6  0x082e1e07 in ipc_do_cmd (data=0x8b161c0 <heapmem+1472>)
                                    at src/ipc/ipc-common.c:328
 #7  0x083696aa in task_run (task=0x8b161e8 <heapmem+1512>)
                                    at zephyr/include/rtos/task.h:94
 #8  0x083682dc in edf_work_handler (work=0x8b1621c <heapmem+1564>)
                                    at zephyr/edf_schedule.c:32
 #9  0x085245af in work_queue_main (workq_ptr=0x8b15b00 <edf_workq>,...)
                                         at zephyr/kernel/work.c:688
 #10 0x0823a6bc in z_thread_entry (entry=0x8523be0 <work_queue_main>,..
                                    at zephyr/lib/os/thread_entry.c:48
 #11 0x0829a6a1 in posix_arch_thread_entry (pa_thread_status=0x8630648 ..
                                  at zephyr/arch/posix/core/thread.c:56
 #12 0x0829c043 in posix_thread_starter (arg=0x4)
                              at zephyr/arch/posix/core/posix_core.c:293
 #13 0x080f6041 in asan_thread_start(void*) ()
 #14 0xf788c73c in ?? () from /usr/lib32/libc.so.6

Signed-off-by: Marc Herbert marc.herbert@intel.com

andyross
andyross previously approved these changes Feb 3, 2024
Comment thread include/zephyr/arch/posix/arch.h Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second paragraph is true, but can be condensed to something like "ARCH_EXCEPT() can't possibly be used in user mode, and the syscall headers don't work this early".

@marc-hb marc-hb force-pushed the posix-abort branch 3 times, most recently from 682b2fe to baa99ea Compare February 5, 2024 21:23
@nashif nashif changed the title arch/posix.h: implement ARCH_EXCEPT() with abort() for debug convenience posix/arch.h: implement ARCH_EXCEPT() with abort() for debug convenience Feb 5, 2024
@nashif nashif requested a review from aescolar February 5, 2024 21:54
Flush all messages and invoke `abort()` when a k_panic() or k_oops() is
hit in native_posix mode.

One of the main purposes of `native_posix` is to provide debug
convenience. When running in a debugger, `abort()` stops execution which
provides a backtrace and the ability to inspect all variables.

A good, sample use case is fuzzing failures in SOF, see an example in:
thesofproject/sof#8632

In such a case, this commit adds value even before using a
debugger. Without this commit, confusingly meaningless stack trace:

```
INFO: seed corpus: files: 1097 min: 1b max: 428b total: 90853b rss: 58Mb
Exiting due to fatal error
==314134== ERROR: libFuzzer: fuzz target exited
    #0 0x81d9637 in __sanitizer_print_stack_trace (zephyr.exe+0x81d9637)
    #1 0x80cc42b in fuzzer::PrintStackTrace() (zephyr.exe+0x80cc42b)
    zephyrproject-rtos#2 0x80ab79e in fuzzer::Fuzzer::ExitCallback() FuzzerLoop.cpp.o
    zephyrproject-rtos#3 0x80ab864 in fuzzer::Fuzzer::StaticExitCallback() (zephyr.exe+
    zephyrproject-rtos#4 0xf783dfe8  (/usr/lib32/libc.so.6+0x3dfe8)
    zephyrproject-rtos#5 0xf783e1e6 in exit (/usr/lib32/libc.so.6+0x3e1e6)
    zephyrproject-rtos#6 0x82a5488 in posix_exit boards/posix/native_posix/main.c:51:2

SUMMARY: libFuzzer: fuzz target exited
```

Thanks to this commit the `k_panic()` location is now immediately
available in test logs without even running anything locally:

```
INFO: seed corpus: files: 1097 min: 1b max: 428b total: 90853b rss: 58Mb
@ WEST_TOPDIR/sof/src/ipc/ipc3/handler.c:1623
ZEPHYR FATAL ERROR: 3
==315176== ERROR: libFuzzer: deadly signal
LLVMSymbolizer: error reading file: No such file or directory
    #0 0x81d9647 in __sanitizer_print_stack_trace (zephyr.exe+0x81d9647)
    #1 0x80cc43b in fuzzer::PrintStackTrace() (zephyr.exe+0x80cc43b)
    zephyrproject-rtos#2 0x80ab6be in fuzzer::Fuzzer::CrashCallback() FuzzerLoop.cpp.o
    zephyrproject-rtos#3 0x80ab77b in fuzzer::Fuzzer::StaticCrashSignalCallback()
    zephyrproject-rtos#4 0xf7f3159f  (linux-gate.so.1+0x59f)
    zephyrproject-rtos#5 0xf7f31578  (linux-gate.so.1+0x578)
    zephyrproject-rtos#6 0xf788ea16  (/usr/lib32/libc.so.6+0x8ea16)
    zephyrproject-rtos#7 0xf783b316 in raise (/usr/lib32/libc.so.6+0x3b316)
    zephyrproject-rtos#8 0xf7822120 in abort (/usr/lib32/libc.so.6+0x22120)
    zephyrproject-rtos#9 0x82afbde in ipc_cmd src/ipc/ipc3/handler.c:1623:2

NOTE: libFuzzer has rudimentary signal handlers.
      Combine libFuzzer with AddressSanitizer or similar for better
    crash reports.
SUMMARY: libFuzzer: deadly signal
```

The full stack trace is now immediately available when running
zephyr.exe in gdb:

```
./scripts/fuzz.sh  -- -DEXTRA_CFLAGS="-O0 -g3"

gdb build-fuzz/zephyr/zephyr.exe

run
backtrace

 zephyrproject-rtos#2  0xf783b317 in raise () from /usr/lib32/libc.so.6
 zephyrproject-rtos#3  0xf7822121 in abort () from /usr/lib32/libc.so.6
 zephyrproject-rtos#4  0x082afbdf in ipc_cmd (_hdr=0x8b...) at src/ipc/ipc3/handler.c:1623
 zephyrproject-rtos#5  0x082fbf4b in ipc_platform_do_cmd (ipc=0x8b161c0)
                                    at src/platform/posix/ipc.c:162
 zephyrproject-rtos#6  0x082e1e07 in ipc_do_cmd (data=0x8b161c0 <heapmem+1472>)
                                    at src/ipc/ipc-common.c:328
 zephyrproject-rtos#7  0x083696aa in task_run (task=0x8b161e8 <heapmem+1512>)
                                    at zephyr/include/rtos/task.h:94
 zephyrproject-rtos#8  0x083682dc in edf_work_handler (work=0x8b1621c <heapmem+1564>)
                                    at zephyr/edf_schedule.c:32
 zephyrproject-rtos#9  0x085245af in work_queue_main (workq_ptr=0x8b15b00 <edf_workq>,...)
                                         at zephyr/kernel/work.c:688
 zephyrproject-rtos#10 0x0823a6bc in z_thread_entry (entry=0x8523be0 <work_queue_main>,..
                                    at zephyr/lib/os/thread_entry.c:48
 zephyrproject-rtos#11 0x0829a6a1 in posix_arch_thread_entry (pa_thread_status=0x8630648 ..
                                  at zephyr/arch/posix/core/thread.c:56
 zephyrproject-rtos#12 0x0829c043 in posix_thread_starter (arg=0x4)
                              at zephyr/arch/posix/core/posix_core.c:293
 zephyrproject-rtos#13 0x080f6041 in asan_thread_start(void*) ()
 zephyrproject-rtos#14 0xf788c73c in ?? () from /usr/lib32/libc.so.6
```

Signed-off-by: Marc Herbert <marc.herbert@intel.com>
@marc-hb marc-hb marked this pull request as ready for review February 6, 2024 00:32
@zephyrbot zephyrbot added area: C Library C Standard Library area: native port Host native arch port (native_sim) area: Architectures labels Feb 6, 2024
@marc-hb marc-hb requested review from ceolin and teburd February 6, 2024 02:17
@aescolar aescolar assigned aescolar and unassigned stephanosio Feb 6, 2024
Copy link
Copy Markdown
Member

@aescolar aescolar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @marc-hb
There is quite many combinations here, so let me ask a couple of questions to be sure we are in the same page.

  • Is your concern only when k_panic/ops() is called? (or also when abort() is called directly by somebody?)
  • Are you concerned about a case in which CONFIG_COMMON_LIBC_ABORT is set in some other kconfig for native_posix? (note that when building with the host C library it is not selected by default, and it is not user selectable, so this combination should not be normally hit. It will be set by default when using native_sim with picolibc).
  • For the debugger stopping on these cases, does CONFIG_ARCH_POSIX_TRAP_ON_FATAL do what you need
    already?

One note about the logger dumping pending messages, it would seem the current path thru k_sys_fatal_error_handler() calls into LOG_PANIC().
So I guess logger handling in this PR is only due to implementing this "shorter" path?

About the stack trace, it is true that in general (and for most SW/Zephyr side errors), the call stack on the final exit() will not be too indicative as the native code has gone on a cleanup trip before calling exit(), but note that in general that cleanup trip is necessary, as all the ON_EXIT hooks are being called which allow for possible arch_posix aware test code to do whatever it needed to do on exit.
Note with these errors at least you get printed an error like:

@ ZEPHYR_BASE/<path_to_file>/<file>:<line>
Exiting due to fatal error

(This is not a rejection per se, just a "lets make sure we talk")

@marc-hb
Copy link
Copy Markdown
Contributor Author

marc-hb commented Feb 7, 2024

Thanks a lot for the expert and prompt feedback! Much appreciated.

There is quite many combinations here,

Tell me about it! I didn't have any "concern", only one specific goal when I naively and ignorantly started to do this but I'm afraid I'm getting there :-)

to be sure we are in the same page.

I suspect we started on different... planets. Thanks again for trying to bring me closer.

One "concerning" thing I just realized:
https://docs.zephyrproject.org/latest/kernel/services/other/fatal.html#kernel-oops

A kernel error is a software triggered fatal error invoked by k_panic(). This should be used to indicate that the Zephyr kernel is in an unrecoverable state. Implementations of k_sys_fatal_error_handler() should not return if the kernel encounters a panic condition ...

This documentation does not seem to match the current code. k_panic() invokes k_sys_fatal_error_handler() ONLY when ARCH_EXCEPT() is missing. Pretty much every architecture outside POSIX defines ARCH_EXCEPT()! So this page is valid only for POSIX? Does not really make sense...

EDIT: in kernel.h:

#ifdef ARCH_EXCEPT
/* This architecture has direct support for triggering a CPU exception */
#define z_except_reason(reason)	ARCH_EXCEPT(reason)
#else
#define z_except_reason(reason) 
     ...
     z_fatal_error() // invokes k_sys_fatal_error_handler()
     ...
#endif

EDIT2: answered below.

Is your concern only when k_panic/ops() is called? (or also when abort() is called directly by somebody?)

The former, see example in the commit message. I think k_panic() should not exit() but abort() by default on POSIX - no matter what LIBC is used. This is my only goal (for now).

Except when testing k_panic() itself, when you have a k_panic() then you want to debug that panic. On POSIX, exit() is useless for debugging and abort() is very useful, Q.E.D.

This is not just "practical". Conceptually, abort() is much closer to a "panic" than exit().

For the debugger stopping on these cases, does CONFIG_ARCH_POSIX_TRAP_ON_FATAL do what you need already?

I didn't know about CONFIG_ARCH_POSIX_TRAP_ON_FATAL, thanks. It does solve the problem when running in a debugger BUT:

  • it's not enabled by default (why not?) and most people won't know this even exists. I sure didn't.
  • Unlike this PR, CONFIG_ARCH_POSIX_TRAP_ON_FATAL does not print anything useful, which makes it useless outside a debugger (e.g.: in CI)

Are you concerned about a case in which CONFIG_COMMON_LIBC_ABORT is ...

My abort.c change was only to break the infinite loop that this PR introduces. Let's not worry about that yet, I suspect much bigger changes to this PR are coming...

EDIT

So I guess logger handling in this PR is only due to implementing this "shorter" path?

Yes. Again: implementation detail. Let's not worry about that yet.

@marc-hb marc-hb marked this pull request as draft February 8, 2024 00:56
@aescolar
Copy link
Copy Markdown
Member

aescolar commented Feb 8, 2024

This documentation does not seem to match the current code

Maybe @andyross wants to handle that sidetrack :)

I didn't know about CONFIG_ARCH_POSIX_TRAP_ON_FATAL,.. BUT .. it's not enabled by default (why not?)

If we enable it by default the program will just stop there, hard.
Note that we have several native targets, and that in general a Zephyr kernel panic is something the native runner/posix_arch aware code continues quite gracefully from. So, for example, in multidevice and multimcu simulations, we have some degree of notification and cleanup (and file buffers are flushed and so forth).

For SOF, if you want you can just select that option from a SOF one. I guess that would cover this part of your need without needing to change anything in Zephyr?

Regarding getting info about where the issue happened. If we take this example:

diff --git a/samples/hello_world/src/main.c b/samples/hello_world/src/main.c
index 2758d75d3f..daab32dba8 100644
--- a/samples/hello_world/src/main.c
+++ b/samples/hello_world/src/main.c
@@ -5,9 +5,11 @@
  */
 
 #include <stdio.h>
+#include <zephyr/kernel.h>
 
 int main(void)
 {
        printf("Hello World! %s\n", CONFIG_BOARD);
+       k_panic();
        return 0;
 }

The output would be a short:

$ zephyr/zephyr.exe
*** Booting Zephyr OS build v3.6.0-rc1-60-gb30d088d37ae ***
Hello World! native_sim
@ ZEPHYR_BASE/samples/hello_world/src/main.c:13
Exiting due to fatal error

So you do get a relatively nice pointer. One could want something more detailed, and there I guess we'd go into how much should be provided by the native code, and how much by the tooling you use.

If built with CONFIG_ARCH_POSIX_TRAP_ON_FATAL=y run in gdb and check the backtrace where it traps, you see a proper callstack. Also valgrind will provide you a nice callstack out of the box in this case:

==1140916== Process terminating with default action of signal 5 (SIGTRAP)
==1140916==    at 0x4204AA7: __pthread_kill_implementation (pthread_kill.c:44)
==1140916==    by 0x41B3684: raise (raise.c:26)
==1140916==    by 0x804E392: nsi_raise_sigtrap (fatal_trap.c:11)
==1140916==    by 0x8049F62: arch_system_halt (fatal.c:23)
==1140916==    by 0x804A600: k_sys_fatal_error_handler (fatal.c:44)
==1140916==    by 0x804A62B: z_fatal_error (fatal.c:118)
==1140916==    by 0x8049476: main (main.c:13)

So I wonder if you'd think that is good enough, or you would like the native code itself to do something more like print out a relevant portion of the callstack on arch_system_halt() or so..

@marc-hb

This comment was marked as resolved.

@aescolar

This comment was marked as resolved.

@marc-hb
Copy link
Copy Markdown
Contributor Author

marc-hb commented Feb 8, 2024

I tested libFuzzer in more configurations and I confirmed these two observations:

(1) libFuzzer invokes llvm-symbolizer and prints a stack trace on SIGABRT but not on SIGTRAP
(2) when there is one, the stack trace printed by LibFuzzer never goes beyond the first zephyr.exe frame?! That's why the ARCH_EXCEPT() solution is more useful: because it's macro-based. Showing that zephyr stopped in the nsi_raise_sigtrap() frame but nothing further is obviously useless.

https://github.com/llvm/llvm-project/blob/main/compiler-rt/lib/fuzzer/FuzzerUtil.cpp#L207

I unfortunately could not find in the libFuzzer documentation any reason for these behaviors. @andyross maybe?

There is none of these limitations when running the very same zephyr.exe with libFuzzer inside gdb; this always provides a complete stack trace with either SIGABRT and SIGTRAP (but not on exit())

Obviously, the error handling design in Zephyr should not be completely based on libFuzzer's behavior. On the other hand, the difference between SIGBART and SIGTRAP may not be accidental and could reflect some standard practice?

@andyross
Copy link
Copy Markdown
Contributor

andyross commented Feb 9, 2024

I'm no libfuzzer expert either. My guess is that libfuzzer doesn't trap on SIGTRAP because it wants to permit users to debug a fuzz-enabled binary (it might very well be using it internally too). Fuzzing is supposed to detect errors, and SIGTRAP isn't really an error, even though it can be handled.

I don't know that this needs to that complicated. Just augmenting TRAP_ON_FATAL to set the signal delivered would be IMHO acceptable (maybe rename it to "RAISE_ON_FATAL" then I guess). Alternatively we could impelment a sys_fatal_error_handler() in the fuzzing rig, but that then gets fragile if some future layer wants to handle fatal errors out of a different context.

Really what fuzzing wants is exactly what @marc-hb has here: we want the error to be immediate, synchronous, fatal to the Linux process, and trappable by libfuzzer. And we want it expanded as a macro and not part of the Zephyr API framework that might lose info about the location of the error.

marc-hb added a commit to marc-hb/sof that referenced this pull request Feb 27, 2024
As discussed in the alternative approach
zephyrproject-rtos/zephyr#68494, k_panic() in
POSIX mode has various shortcomings that do not provide a useful
trace. Useless pointers to signal handlers or other cleanup routines are
printed instead. Leverage our already existing
k_sys_fatal_error_handler() and dereference a NULL pointer there when in
POSIX mode. This "fails fast" and provides a complete and relevant stack
trace in CI when fuzzing or when using some other static
analyzer. Example of how fuzzing failure thesofproject#8832 would have looked like in
CI results thanks to this commit:

```
./build-fuzz/zephyr/zephyr.exe: Running 1 inputs 1 time(s) each.
Running: ./rballoc_align_fuzz_crash
*** Booting Zephyr OS build zephyr-v3.5.0-3971-ge07de4e0a167 ***
[00:00:00.000,000] <inf> main: SOF on native_posix
[00:00:00.000,000] <inf> main: SOF initialized
@ WEST_TOPDIR/sof/zephyr/lib/alloc.c:391
[00:00:00.000,000] <err> os: >>> ZEPHYR FATAL ERROR 4: Kernel panic
[00:00:00.000,000] <err> os: Current thread: 0x891f8a0 (unknown)
[00:00:00.000,000] <err> zephyr: Halting emulation
AddressSanitizer:DEADLYSIGNAL
=================================================================
==1784402==ERROR: AddressSanitizer: SEGV on unknown address 0x00000000
==1784402==The signal is caused by a WRITE memory access.
==1784402==Hint: address points to the zero page.
    #0 0x829a77d in k_sys_fatal_error_handler zephyr/wrapper.c:352:19
    #1 0x829b8c0 in rballoc_align zephyr/lib/alloc.c:391:3
    thesofproject#2 0x8281438 in buffer_alloc src/audio/buffer.c:58:16
    thesofproject#3 0x826a60a in buffer_new src/ipc/ipc-helper.c:48:11
    thesofproject#4 0x8262107 in ipc_buffer_new src/ipc/ipc3/helper.c:459:11
    thesofproject#5 0x825944d in ipc_glb_tplg_buffer_new src/ipc/ipc3/handler.c:1305:8
    thesofproject#6 0x8257036 in ipc_cmd src/ipc/ipc3/handler.c:1651:9
    thesofproject#7 0x8272e59 in ipc_platform_do_cmd src/platform/posix/ipc.c:162:2
    thesofproject#8 0x826a2ac in ipc_do_cmd src/ipc/ipc-common.c:328:9
    thesofproject#9 0x829b0ab in task_run zephyr/include/rtos/task.h:94:9
    thesofproject#10 0x829abd8 in edf_work_handler zephyr/edf_schedule.c:32:16
    thesofproject#11 0x83560f7 in work_queue_main zephyr/kernel/work.c:688:3
    thesofproject#12 0x82244c2 in z_thread_entry zephyr/lib/os/thread_entry.c:48:2
```

Signed-off-by: Marc Herbert <marc.herbert@intel.com>
@marc-hb
Copy link
Copy Markdown
Contributor Author

marc-hb commented Feb 27, 2024

This documentation does not seem to match the current code. k_panic() invokes k_sys_fatal_error_handler() ONLY when ARCH_EXCEPT() is missing. Pretty much every architecture outside POSIX defines ARCH_EXCEPT()! So this page is valid only for POSIX? Does not really make sense...

Answering myself: on hardware ARCH_EXCEPT triggers a hardware fault which invokes some registered z_${arch}_fatal_error() handler which invokes k_sys_fatal_error_handler(). That's the connection I was missing.

Zephyr's error handling seems to have quite a few layers and but fatal.rst is very high-level and barely ever mentions the relationship between software exceptions, APIs and hardware exceptions.

(2) when there is one, the stack trace printed by LibFuzzer never goes beyond the first zephyr.exe frame?!

On the other hand, the stack trace from ASAN (rung from LibFuzzer) looks great:

https://github.com/llvm/llvm-project/blob/main/compiler-rt/lib/fuzzer/FuzzerUtil.cpp#L207

marc-hb added a commit to marc-hb/sof that referenced this pull request Feb 27, 2024
As discussed in the alternative approach
zephyrproject-rtos/zephyr#68494, k_panic() in
POSIX mode has various shortcomings that do not provide a useful
trace. Useless pointers to signal handlers or other cleanup routines are
printed instead. Leverage our already existing
k_sys_fatal_error_handler() and dereference a NULL pointer there when in
POSIX mode. This "fails fast" and provides a complete and relevant stack
trace in CI when fuzzing or when using some other static
analyzer. Example of how fuzzing failure thesofproject#8832 would have looked like in
CI results thanks to this commit:

```
./build-fuzz/zephyr/zephyr.exe: Running 1 inputs 1 time(s) each.
Running: ./rballoc_align_fuzz_crash
*** Booting Zephyr OS build zephyr-v3.5.0-3971-ge07de4e0a167 ***
[00:00:00.000,000] <inf> main: SOF on native_posix
[00:00:00.000,000] <inf> main: SOF initialized
@ WEST_TOPDIR/sof/zephyr/lib/alloc.c:391
[00:00:00.000,000] <err> os: >>> ZEPHYR FATAL ERROR 4: Kernel panic
[00:00:00.000,000] <err> os: Current thread: 0x891f8a0 (unknown)
[00:00:00.000,000] <err> zephyr: Halting emulation
AddressSanitizer:DEADLYSIGNAL
=================================================================
==1784402==ERROR: AddressSanitizer: SEGV on unknown address 0x00000000
==1784402==The signal is caused by a WRITE memory access.
==1784402==Hint: address points to the zero page.
    #0 0x829a77d in k_sys_fatal_error_handler zephyr/wrapper.c:352:19
    #1 0x829b8c0 in rballoc_align zephyr/lib/alloc.c:391:3
    thesofproject#2 0x8281438 in buffer_alloc src/audio/buffer.c:58:16
    thesofproject#3 0x826a60a in buffer_new src/ipc/ipc-helper.c:48:11
    thesofproject#4 0x8262107 in ipc_buffer_new src/ipc/ipc3/helper.c:459:11
    thesofproject#5 0x825944d in ipc_glb_tplg_buffer_new src/ipc/ipc3/handler.c:1305:8
    thesofproject#6 0x8257036 in ipc_cmd src/ipc/ipc3/handler.c:1651:9
    thesofproject#7 0x8272e59 in ipc_platform_do_cmd src/platform/posix/ipc.c:162:2
    thesofproject#8 0x826a2ac in ipc_do_cmd src/ipc/ipc-common.c:328:9
    thesofproject#9 0x829b0ab in task_run zephyr/include/rtos/task.h:94:9
    thesofproject#10 0x829abd8 in edf_work_handler zephyr/edf_schedule.c:32:16
    thesofproject#11 0x83560f7 in work_queue_main zephyr/kernel/work.c:688:3
    thesofproject#12 0x82244c2 in z_thread_entry zephyr/lib/os/thread_entry.c:48:2
```

Signed-off-by: Marc Herbert <marc.herbert@intel.com>
@marc-hb marc-hb requested a review from GRobertZieba February 27, 2024 23:53
lgirdwood pushed a commit to thesofproject/sof that referenced this pull request Feb 28, 2024
As discussed in the alternative approach
zephyrproject-rtos/zephyr#68494, k_panic() in
POSIX mode has various shortcomings that do not provide a useful
trace. Useless pointers to signal handlers or other cleanup routines are
printed instead. Leverage our already existing
k_sys_fatal_error_handler() and dereference a NULL pointer there when in
POSIX mode. This "fails fast" and provides a complete and relevant stack
trace in CI when fuzzing or when using some other static
analyzer. Example of how fuzzing failure #8832 would have looked like in
CI results thanks to this commit:

```
./build-fuzz/zephyr/zephyr.exe: Running 1 inputs 1 time(s) each.
Running: ./rballoc_align_fuzz_crash
*** Booting Zephyr OS build zephyr-v3.5.0-3971-ge07de4e0a167 ***
[00:00:00.000,000] <inf> main: SOF on native_posix
[00:00:00.000,000] <inf> main: SOF initialized
@ WEST_TOPDIR/sof/zephyr/lib/alloc.c:391
[00:00:00.000,000] <err> os: >>> ZEPHYR FATAL ERROR 4: Kernel panic
[00:00:00.000,000] <err> os: Current thread: 0x891f8a0 (unknown)
[00:00:00.000,000] <err> zephyr: Halting emulation
AddressSanitizer:DEADLYSIGNAL
=================================================================
==1784402==ERROR: AddressSanitizer: SEGV on unknown address 0x00000000
==1784402==The signal is caused by a WRITE memory access.
==1784402==Hint: address points to the zero page.
    #0 0x829a77d in k_sys_fatal_error_handler zephyr/wrapper.c:352:19
    #1 0x829b8c0 in rballoc_align zephyr/lib/alloc.c:391:3
    #2 0x8281438 in buffer_alloc src/audio/buffer.c:58:16
    #3 0x826a60a in buffer_new src/ipc/ipc-helper.c:48:11
    #4 0x8262107 in ipc_buffer_new src/ipc/ipc3/helper.c:459:11
    #5 0x825944d in ipc_glb_tplg_buffer_new src/ipc/ipc3/handler.c:1305:8
    #6 0x8257036 in ipc_cmd src/ipc/ipc3/handler.c:1651:9
    #7 0x8272e59 in ipc_platform_do_cmd src/platform/posix/ipc.c:162:2
    #8 0x826a2ac in ipc_do_cmd src/ipc/ipc-common.c:328:9
    #9 0x829b0ab in task_run zephyr/include/rtos/task.h:94:9
    #10 0x829abd8 in edf_work_handler zephyr/edf_schedule.c:32:16
    #11 0x83560f7 in work_queue_main zephyr/kernel/work.c:688:3
    #12 0x82244c2 in z_thread_entry zephyr/lib/os/thread_entry.c:48:2
```

Signed-off-by: Marc Herbert <marc.herbert@intel.com>
@marc-hb
Copy link
Copy Markdown
Contributor Author

marc-hb commented Feb 28, 2024

Just augmenting TRAP_ON_FATAL to set the signal delivered would be IMHO acceptable (maybe rename it to "RAISE_ON_FATAL" then I guess)

I think this + fixing libFuzzer to print a complete stack trace would get this job done:

we want the error to be immediate, synchronous, fatal to the Linux process, and trappable by libfuzzer.

However libFuzzer is not actively developed any more (bug fixes only) and it would take a lot of time for the change to trickle down to Linux distributions and Github runners.

https://github.com/llvm/llvm-project/blob/main/compiler-rt/lib/fuzzer/FuzzerUtil.cpp#L207

I found a much, much faster solution: dereferencing NULL in the app-specific k_sys_fatal_error_handler(). Just merged in thesofproject/sof@884305b

It's quite unfortunate that it is app-specific but it does the job and I already spent way too much time on this - closing.

@marc-hb marc-hb closed this Feb 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: Architectures area: C Library C Standard Library area: native port Host native arch port (native_sim)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants