posix/arch.h: implement ARCH_EXCEPT() with abort() for debug convenience#68494
posix/arch.h: implement ARCH_EXCEPT() with abort() for debug convenience#68494marc-hb wants to merge 1 commit intozephyrproject-rtos:mainfrom
Conversation
There was a problem hiding this comment.
The second paragraph is true, but can be condensed to something like "ARCH_EXCEPT() can't possibly be used in user mode, and the syscall headers don't work this early".
682b2fe to
baa99ea
Compare
Flush all messages and invoke `abort()` when a k_panic() or k_oops() is hit in native_posix mode. One of the main purposes of `native_posix` is to provide debug convenience. When running in a debugger, `abort()` stops execution which provides a backtrace and the ability to inspect all variables. A good, sample use case is fuzzing failures in SOF, see an example in: thesofproject/sof#8632 In such a case, this commit adds value even before using a debugger. Without this commit, confusingly meaningless stack trace: ``` INFO: seed corpus: files: 1097 min: 1b max: 428b total: 90853b rss: 58Mb Exiting due to fatal error ==314134== ERROR: libFuzzer: fuzz target exited #0 0x81d9637 in __sanitizer_print_stack_trace (zephyr.exe+0x81d9637) #1 0x80cc42b in fuzzer::PrintStackTrace() (zephyr.exe+0x80cc42b) zephyrproject-rtos#2 0x80ab79e in fuzzer::Fuzzer::ExitCallback() FuzzerLoop.cpp.o zephyrproject-rtos#3 0x80ab864 in fuzzer::Fuzzer::StaticExitCallback() (zephyr.exe+ zephyrproject-rtos#4 0xf783dfe8 (/usr/lib32/libc.so.6+0x3dfe8) zephyrproject-rtos#5 0xf783e1e6 in exit (/usr/lib32/libc.so.6+0x3e1e6) zephyrproject-rtos#6 0x82a5488 in posix_exit boards/posix/native_posix/main.c:51:2 SUMMARY: libFuzzer: fuzz target exited ``` Thanks to this commit the `k_panic()` location is now immediately available in test logs without even running anything locally: ``` INFO: seed corpus: files: 1097 min: 1b max: 428b total: 90853b rss: 58Mb @ WEST_TOPDIR/sof/src/ipc/ipc3/handler.c:1623 ZEPHYR FATAL ERROR: 3 ==315176== ERROR: libFuzzer: deadly signal LLVMSymbolizer: error reading file: No such file or directory #0 0x81d9647 in __sanitizer_print_stack_trace (zephyr.exe+0x81d9647) #1 0x80cc43b in fuzzer::PrintStackTrace() (zephyr.exe+0x80cc43b) zephyrproject-rtos#2 0x80ab6be in fuzzer::Fuzzer::CrashCallback() FuzzerLoop.cpp.o zephyrproject-rtos#3 0x80ab77b in fuzzer::Fuzzer::StaticCrashSignalCallback() zephyrproject-rtos#4 0xf7f3159f (linux-gate.so.1+0x59f) zephyrproject-rtos#5 0xf7f31578 (linux-gate.so.1+0x578) zephyrproject-rtos#6 0xf788ea16 (/usr/lib32/libc.so.6+0x8ea16) zephyrproject-rtos#7 0xf783b316 in raise (/usr/lib32/libc.so.6+0x3b316) zephyrproject-rtos#8 0xf7822120 in abort (/usr/lib32/libc.so.6+0x22120) zephyrproject-rtos#9 0x82afbde in ipc_cmd src/ipc/ipc3/handler.c:1623:2 NOTE: libFuzzer has rudimentary signal handlers. Combine libFuzzer with AddressSanitizer or similar for better crash reports. SUMMARY: libFuzzer: deadly signal ``` The full stack trace is now immediately available when running zephyr.exe in gdb: ``` ./scripts/fuzz.sh -- -DEXTRA_CFLAGS="-O0 -g3" gdb build-fuzz/zephyr/zephyr.exe run backtrace zephyrproject-rtos#2 0xf783b317 in raise () from /usr/lib32/libc.so.6 zephyrproject-rtos#3 0xf7822121 in abort () from /usr/lib32/libc.so.6 zephyrproject-rtos#4 0x082afbdf in ipc_cmd (_hdr=0x8b...) at src/ipc/ipc3/handler.c:1623 zephyrproject-rtos#5 0x082fbf4b in ipc_platform_do_cmd (ipc=0x8b161c0) at src/platform/posix/ipc.c:162 zephyrproject-rtos#6 0x082e1e07 in ipc_do_cmd (data=0x8b161c0 <heapmem+1472>) at src/ipc/ipc-common.c:328 zephyrproject-rtos#7 0x083696aa in task_run (task=0x8b161e8 <heapmem+1512>) at zephyr/include/rtos/task.h:94 zephyrproject-rtos#8 0x083682dc in edf_work_handler (work=0x8b1621c <heapmem+1564>) at zephyr/edf_schedule.c:32 zephyrproject-rtos#9 0x085245af in work_queue_main (workq_ptr=0x8b15b00 <edf_workq>,...) at zephyr/kernel/work.c:688 zephyrproject-rtos#10 0x0823a6bc in z_thread_entry (entry=0x8523be0 <work_queue_main>,.. at zephyr/lib/os/thread_entry.c:48 zephyrproject-rtos#11 0x0829a6a1 in posix_arch_thread_entry (pa_thread_status=0x8630648 .. at zephyr/arch/posix/core/thread.c:56 zephyrproject-rtos#12 0x0829c043 in posix_thread_starter (arg=0x4) at zephyr/arch/posix/core/posix_core.c:293 zephyrproject-rtos#13 0x080f6041 in asan_thread_start(void*) () zephyrproject-rtos#14 0xf788c73c in ?? () from /usr/lib32/libc.so.6 ``` Signed-off-by: Marc Herbert <marc.herbert@intel.com>
There was a problem hiding this comment.
Thanks @marc-hb
There is quite many combinations here, so let me ask a couple of questions to be sure we are in the same page.
- Is your concern only when k_panic/ops() is called? (or also when abort() is called directly by somebody?)
- Are you concerned about a case in which CONFIG_COMMON_LIBC_ABORT is set in some other kconfig for native_posix? (note that when building with the host C library it is not selected by default, and it is not user selectable, so this combination should not be normally hit. It will be set by default when using native_sim with picolibc).
- For the debugger stopping on these cases, does CONFIG_ARCH_POSIX_TRAP_ON_FATAL do what you need
already?
One note about the logger dumping pending messages, it would seem the current path thru k_sys_fatal_error_handler() calls into LOG_PANIC().
So I guess logger handling in this PR is only due to implementing this "shorter" path?
About the stack trace, it is true that in general (and for most SW/Zephyr side errors), the call stack on the final exit() will not be too indicative as the native code has gone on a cleanup trip before calling exit(), but note that in general that cleanup trip is necessary, as all the ON_EXIT hooks are being called which allow for possible arch_posix aware test code to do whatever it needed to do on exit.
Note with these errors at least you get printed an error like:
@ ZEPHYR_BASE/<path_to_file>/<file>:<line>
Exiting due to fatal error
(This is not a rejection per se, just a "lets make sure we talk")
|
Thanks a lot for the expert and prompt feedback! Much appreciated.
Tell me about it! I didn't have any "concern", only one specific goal when I naively and ignorantly started to do this but I'm afraid I'm getting there :-)
I suspect we started on different... planets. Thanks again for trying to bring me closer. One "concerning" thing I just realized:
EDIT: in #ifdef ARCH_EXCEPT
/* This architecture has direct support for triggering a CPU exception */
#define z_except_reason(reason) ARCH_EXCEPT(reason)
#else
#define z_except_reason(reason)
...
z_fatal_error() // invokes k_sys_fatal_error_handler()
...
#endifEDIT2: answered below.
The former, see example in the commit message. I think k_panic() should not exit() but abort() by default on POSIX - no matter what LIBC is used. This is my only goal (for now). Except when testing k_panic() itself, when you have a k_panic() then you want to debug that panic. On POSIX, exit() is useless for debugging and abort() is very useful, Q.E.D. This is not just "practical". Conceptually, abort() is much closer to a "panic" than exit().
I didn't know about
My abort.c change was only to break the infinite loop that this PR introduces. Let's not worry about that yet, I suspect much bigger changes to this PR are coming... EDIT
Yes. Again: implementation detail. Let's not worry about that yet. |
Maybe @andyross wants to handle that sidetrack :)
If we enable it by default the program will just stop there, hard. For SOF, if you want you can just select that option from a SOF one. I guess that would cover this part of your need without needing to change anything in Zephyr? Regarding getting info about where the issue happened. If we take this example: diff --git a/samples/hello_world/src/main.c b/samples/hello_world/src/main.c
index 2758d75d3f..daab32dba8 100644
--- a/samples/hello_world/src/main.c
+++ b/samples/hello_world/src/main.c
@@ -5,9 +5,11 @@
*/
#include <stdio.h>
+#include <zephyr/kernel.h>
int main(void)
{
printf("Hello World! %s\n", CONFIG_BOARD);
+ k_panic();
return 0;
}The output would be a short: $ zephyr/zephyr.exe
*** Booting Zephyr OS build v3.6.0-rc1-60-gb30d088d37ae ***
Hello World! native_sim
@ ZEPHYR_BASE/samples/hello_world/src/main.c:13
Exiting due to fatal errorSo you do get a relatively nice pointer. One could want something more detailed, and there I guess we'd go into how much should be provided by the native code, and how much by the tooling you use. If built with So I wonder if you'd think that is good enough, or you would like the native code itself to do something more like print out a relevant portion of the callstack on |
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
|
I tested libFuzzer in more configurations and I confirmed these two observations: (1) libFuzzer invokes llvm-symbolizer and prints a stack trace on https://github.com/llvm/llvm-project/blob/main/compiler-rt/lib/fuzzer/FuzzerUtil.cpp#L207 I unfortunately could not find in the libFuzzer documentation any reason for these behaviors. @andyross maybe? There is none of these limitations when running the very same zephyr.exe with libFuzzer inside Obviously, the error handling design in Zephyr should not be completely based on libFuzzer's behavior. On the other hand, the difference between SIGBART and SIGTRAP may not be accidental and could reflect some standard practice? |
|
I'm no libfuzzer expert either. My guess is that libfuzzer doesn't trap on SIGTRAP because it wants to permit users to debug a fuzz-enabled binary (it might very well be using it internally too). Fuzzing is supposed to detect errors, and SIGTRAP isn't really an error, even though it can be handled. I don't know that this needs to that complicated. Just augmenting TRAP_ON_FATAL to set the signal delivered would be IMHO acceptable (maybe rename it to "RAISE_ON_FATAL" then I guess). Alternatively we could impelment a sys_fatal_error_handler() in the fuzzing rig, but that then gets fragile if some future layer wants to handle fatal errors out of a different context. Really what fuzzing wants is exactly what @marc-hb has here: we want the error to be immediate, synchronous, fatal to the Linux process, and trappable by libfuzzer. And we want it expanded as a macro and not part of the Zephyr API framework that might lose info about the location of the error. |
As discussed in the alternative approach zephyrproject-rtos/zephyr#68494, k_panic() in POSIX mode has various shortcomings that do not provide a useful trace. Useless pointers to signal handlers or other cleanup routines are printed instead. Leverage our already existing k_sys_fatal_error_handler() and dereference a NULL pointer there when in POSIX mode. This "fails fast" and provides a complete and relevant stack trace in CI when fuzzing or when using some other static analyzer. Example of how fuzzing failure thesofproject#8832 would have looked like in CI results thanks to this commit: ``` ./build-fuzz/zephyr/zephyr.exe: Running 1 inputs 1 time(s) each. Running: ./rballoc_align_fuzz_crash *** Booting Zephyr OS build zephyr-v3.5.0-3971-ge07de4e0a167 *** [00:00:00.000,000] <inf> main: SOF on native_posix [00:00:00.000,000] <inf> main: SOF initialized @ WEST_TOPDIR/sof/zephyr/lib/alloc.c:391 [00:00:00.000,000] <err> os: >>> ZEPHYR FATAL ERROR 4: Kernel panic [00:00:00.000,000] <err> os: Current thread: 0x891f8a0 (unknown) [00:00:00.000,000] <err> zephyr: Halting emulation AddressSanitizer:DEADLYSIGNAL ================================================================= ==1784402==ERROR: AddressSanitizer: SEGV on unknown address 0x00000000 ==1784402==The signal is caused by a WRITE memory access. ==1784402==Hint: address points to the zero page. #0 0x829a77d in k_sys_fatal_error_handler zephyr/wrapper.c:352:19 #1 0x829b8c0 in rballoc_align zephyr/lib/alloc.c:391:3 thesofproject#2 0x8281438 in buffer_alloc src/audio/buffer.c:58:16 thesofproject#3 0x826a60a in buffer_new src/ipc/ipc-helper.c:48:11 thesofproject#4 0x8262107 in ipc_buffer_new src/ipc/ipc3/helper.c:459:11 thesofproject#5 0x825944d in ipc_glb_tplg_buffer_new src/ipc/ipc3/handler.c:1305:8 thesofproject#6 0x8257036 in ipc_cmd src/ipc/ipc3/handler.c:1651:9 thesofproject#7 0x8272e59 in ipc_platform_do_cmd src/platform/posix/ipc.c:162:2 thesofproject#8 0x826a2ac in ipc_do_cmd src/ipc/ipc-common.c:328:9 thesofproject#9 0x829b0ab in task_run zephyr/include/rtos/task.h:94:9 thesofproject#10 0x829abd8 in edf_work_handler zephyr/edf_schedule.c:32:16 thesofproject#11 0x83560f7 in work_queue_main zephyr/kernel/work.c:688:3 thesofproject#12 0x82244c2 in z_thread_entry zephyr/lib/os/thread_entry.c:48:2 ``` Signed-off-by: Marc Herbert <marc.herbert@intel.com>
Answering myself: on hardware ARCH_EXCEPT triggers a hardware fault which invokes some registered Zephyr's error handling seems to have quite a few layers and but fatal.rst is very high-level and barely ever mentions the relationship between software exceptions, APIs and hardware exceptions.
On the other hand, the stack trace from ASAN (rung from LibFuzzer) looks great: https://github.com/llvm/llvm-project/blob/main/compiler-rt/lib/fuzzer/FuzzerUtil.cpp#L207 |
As discussed in the alternative approach zephyrproject-rtos/zephyr#68494, k_panic() in POSIX mode has various shortcomings that do not provide a useful trace. Useless pointers to signal handlers or other cleanup routines are printed instead. Leverage our already existing k_sys_fatal_error_handler() and dereference a NULL pointer there when in POSIX mode. This "fails fast" and provides a complete and relevant stack trace in CI when fuzzing or when using some other static analyzer. Example of how fuzzing failure thesofproject#8832 would have looked like in CI results thanks to this commit: ``` ./build-fuzz/zephyr/zephyr.exe: Running 1 inputs 1 time(s) each. Running: ./rballoc_align_fuzz_crash *** Booting Zephyr OS build zephyr-v3.5.0-3971-ge07de4e0a167 *** [00:00:00.000,000] <inf> main: SOF on native_posix [00:00:00.000,000] <inf> main: SOF initialized @ WEST_TOPDIR/sof/zephyr/lib/alloc.c:391 [00:00:00.000,000] <err> os: >>> ZEPHYR FATAL ERROR 4: Kernel panic [00:00:00.000,000] <err> os: Current thread: 0x891f8a0 (unknown) [00:00:00.000,000] <err> zephyr: Halting emulation AddressSanitizer:DEADLYSIGNAL ================================================================= ==1784402==ERROR: AddressSanitizer: SEGV on unknown address 0x00000000 ==1784402==The signal is caused by a WRITE memory access. ==1784402==Hint: address points to the zero page. #0 0x829a77d in k_sys_fatal_error_handler zephyr/wrapper.c:352:19 #1 0x829b8c0 in rballoc_align zephyr/lib/alloc.c:391:3 thesofproject#2 0x8281438 in buffer_alloc src/audio/buffer.c:58:16 thesofproject#3 0x826a60a in buffer_new src/ipc/ipc-helper.c:48:11 thesofproject#4 0x8262107 in ipc_buffer_new src/ipc/ipc3/helper.c:459:11 thesofproject#5 0x825944d in ipc_glb_tplg_buffer_new src/ipc/ipc3/handler.c:1305:8 thesofproject#6 0x8257036 in ipc_cmd src/ipc/ipc3/handler.c:1651:9 thesofproject#7 0x8272e59 in ipc_platform_do_cmd src/platform/posix/ipc.c:162:2 thesofproject#8 0x826a2ac in ipc_do_cmd src/ipc/ipc-common.c:328:9 thesofproject#9 0x829b0ab in task_run zephyr/include/rtos/task.h:94:9 thesofproject#10 0x829abd8 in edf_work_handler zephyr/edf_schedule.c:32:16 thesofproject#11 0x83560f7 in work_queue_main zephyr/kernel/work.c:688:3 thesofproject#12 0x82244c2 in z_thread_entry zephyr/lib/os/thread_entry.c:48:2 ``` Signed-off-by: Marc Herbert <marc.herbert@intel.com>
As discussed in the alternative approach zephyrproject-rtos/zephyr#68494, k_panic() in POSIX mode has various shortcomings that do not provide a useful trace. Useless pointers to signal handlers or other cleanup routines are printed instead. Leverage our already existing k_sys_fatal_error_handler() and dereference a NULL pointer there when in POSIX mode. This "fails fast" and provides a complete and relevant stack trace in CI when fuzzing or when using some other static analyzer. Example of how fuzzing failure #8832 would have looked like in CI results thanks to this commit: ``` ./build-fuzz/zephyr/zephyr.exe: Running 1 inputs 1 time(s) each. Running: ./rballoc_align_fuzz_crash *** Booting Zephyr OS build zephyr-v3.5.0-3971-ge07de4e0a167 *** [00:00:00.000,000] <inf> main: SOF on native_posix [00:00:00.000,000] <inf> main: SOF initialized @ WEST_TOPDIR/sof/zephyr/lib/alloc.c:391 [00:00:00.000,000] <err> os: >>> ZEPHYR FATAL ERROR 4: Kernel panic [00:00:00.000,000] <err> os: Current thread: 0x891f8a0 (unknown) [00:00:00.000,000] <err> zephyr: Halting emulation AddressSanitizer:DEADLYSIGNAL ================================================================= ==1784402==ERROR: AddressSanitizer: SEGV on unknown address 0x00000000 ==1784402==The signal is caused by a WRITE memory access. ==1784402==Hint: address points to the zero page. #0 0x829a77d in k_sys_fatal_error_handler zephyr/wrapper.c:352:19 #1 0x829b8c0 in rballoc_align zephyr/lib/alloc.c:391:3 #2 0x8281438 in buffer_alloc src/audio/buffer.c:58:16 #3 0x826a60a in buffer_new src/ipc/ipc-helper.c:48:11 #4 0x8262107 in ipc_buffer_new src/ipc/ipc3/helper.c:459:11 #5 0x825944d in ipc_glb_tplg_buffer_new src/ipc/ipc3/handler.c:1305:8 #6 0x8257036 in ipc_cmd src/ipc/ipc3/handler.c:1651:9 #7 0x8272e59 in ipc_platform_do_cmd src/platform/posix/ipc.c:162:2 #8 0x826a2ac in ipc_do_cmd src/ipc/ipc-common.c:328:9 #9 0x829b0ab in task_run zephyr/include/rtos/task.h:94:9 #10 0x829abd8 in edf_work_handler zephyr/edf_schedule.c:32:16 #11 0x83560f7 in work_queue_main zephyr/kernel/work.c:688:3 #12 0x82244c2 in z_thread_entry zephyr/lib/os/thread_entry.c:48:2 ``` Signed-off-by: Marc Herbert <marc.herbert@intel.com>
I think this + fixing libFuzzer to print a complete stack trace would get this job done:
However libFuzzer is not actively developed any more (bug fixes only) and it would take a lot of time for the change to trickle down to Linux distributions and Github runners. https://github.com/llvm/llvm-project/blob/main/compiler-rt/lib/fuzzer/FuzzerUtil.cpp#L207 I found a much, much faster solution: dereferencing NULL in the app-specific It's quite unfortunate that it is app-specific but it does the job and I already spent way too much time on this - closing. |
Flush all messages and invoke
abort()when a k_panic() or k_oops() ishit in native_posix mode.
One of the main purposes of
native_posixis to provide debugconvenience. When running in a debugger,
abort()stops execution whichprovides a backtrace and the ability to inspect all variables.
A good, sample use case is fuzzing failures in SOF, see an example in:
thesofproject/sof#8632
In such a case, this commit adds value even before using a
debugger. Without this commit, confusingly meaningless stack trace:
Thanks to this commit the
k_panic()location is now immediatelyavailable in test logs without even running anything locally:
The full stack trace is now immediately available when running
zephyr.exe in gdb:
Signed-off-by: Marc Herbert marc.herbert@intel.com