Skip to content

Update madrona upstream#248

Merged
daphne-cornelisse merged 15 commits into
dev_kshotfrom
ap_upstreanm
Apr 15, 2025
Merged

Update madrona upstream#248
daphne-cornelisse merged 15 commits into
dev_kshotfrom
ap_upstreanm

Conversation

@aaravpandya

Copy link
Copy Markdown
Collaborator

No description provided.

@aaravpandya

Copy link
Copy Markdown
Collaborator Author

Hi @shacklettbp,
After upgrading to the latest commit of Madrona, I am unable to run the viewer. Everything else is working as expected. While running the viewer, on the WindowManager initialization I get the following error -

(lldb) next
Process 5870 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = step over
    frame #0: 0x0000000100001f48 viewer`main(argc=2, argv=0x000000016fdff348) at viewer.cpp:69:19
   66  	        true;
   67  	#endif
   68  	
-> 69  	    WindowManager wm {};
   70  	    WindowHandle window = wm.makeWindow("GPUDrive", 1920, 1080);
   71  	    render::GPUHandle render_gpu = wm.initGPU(0, { window.get() });
   72  	
Target 0: (viewer) stopped.
(lldb) next
Process 5870 stopped
* thread #2, stop reason = breakpoint 2.3
    frame #0: 0x000000018b890648 Foundation`-[NSThread main]
Foundation`-[NSThread main]:
->  0x18b890648 <+0>:  pacibsp 
    0x18b89064c <+4>:  stp    x29, x30, [sp, #-0x10]!
    0x18b890650 <+8>:  mov    x29, sp
    0x18b890654 <+12>: ldr    x8, [x0, #0x8]
Target 0: (viewer) stopped.
(lldb) continue
Process 5870 resuming
OOM: 42
Process 5870 exited with status = 1 (0x00000001) 

While I am getting this error on MacOS, but I also see it on linux.

@aaravpandya

Copy link
Copy Markdown
Collaborator Author

Also, I am unable to get the headless to run on CUDA.

(base) aarav@emerge2-desktop:~/gpudrive/build$ ./headless CUDA 1
Compiler Flags:
-I/home/aarav/gpudrive/external/madrona/src/mw/device/include
-I/home/aarav/gpudrive/external/madrona/src/common/../../include
-I/usr/local/cuda-12.8/targets/x86_64-linux/include
-std=c++20
-default-device
-rdc=true
-use_fast_math
-DMADRONA_GPU_MODE=1
-DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_CPP
-DCCCL_DISABLE_BF16_SUPPORT=1
-DCUB_DISABLE_BF16_SUPPORT=1
-arch
sm_89
-DMADRONA_MWGPU_NUM_SMS=(76_i32)
-DMADRONA_MWGPU_MAX_BLOCKS_PER_SM=(1_i32)
-dopt=on
--extra-device-vectorization
-lineinfo
-dlto
-DMADRONA_MWGPU_LTO_MODE=1
-DMADRONA_MWGPU_TASKGRAPH=1

Linker Flags:
-arch=sm_89
-ftz=1
-prec-div=0
-prec-sqrt=0
-fma=1
-optimize-unused-variables
-lineinfo
-lto
-verbose

Compiling GPU engine code:
/home/aarav/gpudrive/external/madrona/src/mw/device/memory.cpp
/home/aarav/gpudrive/external/madrona/src/mw/device/state.cpp
/home/aarav/gpudrive/external/madrona/src/mw/device/crash.cpp
/home/aarav/gpudrive/external/madrona/src/mw/device/consts.cpp
/home/aarav/gpudrive/external/madrona/src/mw/device/taskgraph.cpp
/home/aarav/gpudrive/external/madrona/src/mw/device/taskgraph_utils.cpp
/home/aarav/gpudrive/external/madrona/src/mw/device/sort_archetype.cpp
/home/aarav/gpudrive/external/madrona/src/mw/device/host_print.cpp
/home/aarav/gpudrive/external/madrona/src/mw/../common/hashmap.cpp
/home/aarav/gpudrive/external/madrona/src/mw/../common/navmesh.cpp
/home/aarav/gpudrive/external/madrona/src/mw/../core/base.cpp
/home/aarav/gpudrive/external/madrona/src/mw/../physics/physics.cpp
/home/aarav/gpudrive/external/madrona/src/mw/../physics/geo.cpp
/home/aarav/gpudrive/external/madrona/src/mw/../physics/xpbd.cpp
/home/aarav/gpudrive/external/madrona/src/mw/../physics/tgs.cpp
/home/aarav/gpudrive/external/madrona/src/mw/../physics/narrowphase.cpp
/home/aarav/gpudrive/external/madrona/src/mw/../physics/broadphase.cpp
/home/aarav/gpudrive/external/madrona/src/mw/../render/ecs_system.cpp
/home/aarav/gpudrive/src/sim.cpp
/home/aarav/gpudrive/src/level_gen.cpp
/home/aarav/gpudrive/src/level_gen.cpp(280): warning #177-D: function "gpudrive::createFloorPlane" was declared but never referenced
  static void createFloorPlane(Engine &ctx)
              ^

Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"



CUDA linking Failed!
/home/aarav/gpudrive/src/level_gen.cpp: link error: error: Linking globals named 'initBVHParams': symbol multiply defined!

ERROR 9 in nvvmCompileProgram callback


Error at /home/aarav/gpudrive/external/madrona/src/mw/cuda_exec.cpp:703 in auto madrona::compileCode(const char **, int64_t, const char **, int64_t, const char **, int64_t, const char **, int64_t, const MegakernelConfig *, int64_t, CompileConfig::OptMode, ExecutorMode, bool)::(anonymous class)::operator()(nvJitLinkResult) const
nvJitLink error: NVVM compilation error
Aborted (core dumped)

@shacklettbp

Copy link
Copy Markdown
Collaborator

@aaravpandya I checked out your branch and I get a segfault in loadPhysicsObjects when running ./headless CUDA 1. Am I missing some files or is this another bug?

@shacklettbp

Copy link
Copy Markdown
Collaborator

Can you also give me the command to run the viewer?

@aaravpandya

Copy link
Copy Markdown
Collaborator Author

It looks like another bug. I have been getting that sporadically too while running on CPU.

To run the viewer, its simply ./viewer. To run on cuda, its ./viewer 1 --cuda. I need to reformat the the number of worlds param here, we determine the number of worlds through the length of the maps provided. The path to an example map is hardcoded in.

@shacklettbp

Copy link
Copy Markdown
Collaborator

Ok, I think I fixed the loadPhysicsObjects bug, it was a Madrona issue.

By the way for future reference:

    // Define the texture paths
    const char *texture_paths[] = {
        (std::filesystem::path(DATA_DIR) / "green_grid.png").string().c_str(),
        (std::filesystem::path(DATA_DIR) / "smile.png").string().c_str()
    };

This is unsafe because the std::string (return value of .string()) will be deconstructed immediately after this statement finishes, leaving the const char * pointers dangling. You need to put the std::strings in an array that keeps them in scope and then have a separate array of const char * that point to that array, unfortunately (I'll push this fix).

@shacklettbp

Copy link
Copy Markdown
Collaborator

@aaravpandya After my latest commit this branch works for me now. Let me know if you run into any problems.

The one thing I notice is that when simulating multiple worlds, only the first world shows anything in the viewer, on both CPU and GPU backends. Has the viewer always behaved this way in GPUDrive? This isn't the case for our other environments.

@aaravpandya

Copy link
Copy Markdown
Collaborator Author

Thank you so much @shacklettbp . I see that it now works on CUDA and CPU. Was not able to fully test the viewer on ubuntu, but @daphne-cornelisse ran it and it went past the OOM error from previous.

However, on mac, I am still getting the OOM error. I dont think I am out of memory because I have 48 gigs of ram.

Regarding the viewer, I believe there was drop down from where we can select the world we want to view. I was actually going to ask you if you could provide us with some release notes / documentation regarding the new changes/features in madrona that we can use. For eg, I see that madrona has a new batch renderer and some new named tensor interfaces. Would like to know how we can use them.

@shacklettbp

shacklettbp commented Feb 13, 2025

Copy link
Copy Markdown
Collaborator

What is the full backtrace for the error you're getting on macos? It works on my macbook.

Right, when I select anything other than world 0 in that drop down in the viewer, the world is blank (no entities).

There are no release notes & minimal documentation unfortunately. We added the batch renderer as part of a SIGGRAPH Asia paper: https://madrona-engine.github.io/renderer.html. If you're interested in using it, probably it would make sense to setup a zoom call with everyone (me, anyone on your side, and Luc (the first author on the batch renderer paper)).

The named tensor thing is kind of a TensorDict like interface, I'm using it for JAX interop currently. I don't think it's useful for you guys in its current state, are there any features you need in that space?

@aaravpandya

Copy link
Copy Markdown
Collaborator Author

So I get this same backtrace as before. I made sure I am on the latest commit btw and all submodules are updated. It fails at creating the WindowManager object.

(lldb) 
Process 98495 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = step over
    frame #0: 0x0000000100001f48 viewer`main(argc=2, argv=0x000000016fdff348) at viewer.cpp:69:19
   66  	        true;
   67  	#endif
   68  	
-> 69  	    WindowManager wm {};
   70  	    WindowHandle window = wm.makeWindow("GPUDrive", 1920, 1080);
   71  	    render::GPUHandle render_gpu = wm.initGPU(0, { window.get() });
   72  	
Target 0: (viewer) stopped.
(lldb) next
Process 98495 stopped
* thread #2, stop reason = breakpoint 1.11
    frame #0: 0x000000018512c648 Foundation`-[NSThread main]
Foundation`-[NSThread main]:
->  0x18512c648 <+0>:  pacibsp 
    0x18512c64c <+4>:  stp    x29, x30, [sp, #-0x10]!
    0x18512c650 <+8>:  mov    x29, sp
    0x18512c654 <+12>: ldr    x8, [x0, #0x8]
Target 0: (viewer) stopped.
(lldb) continue
Process 98495 resuming
OOM: 42
Process 98495 exited with status = 1 (0x00000001) 

Perhaps I am on some incompatible XCode or MacOS version? I am on Version 16.2 for XCode and 15.0 Sequoia for MacOS version.

@shacklettbp

Copy link
Copy Markdown
Collaborator

I see, can you run it again under the debugger with the latest commit / submodule update and give the debugger backtrace? You should be able to get a proper back trace now.

@aaravpandya

Copy link
Copy Markdown
Collaborator Author

Hi @shacklettbp Sorry, I got distracted and didnt work on this PR more.
I gave it a try and still dont see a backtrace -

(lldb) br s -f /Users/aaravpandya/dev/gpudrive/external/madrona/src/common/op_new_delete.cpp -l 37
Breakpoint 2: where = libmadrona_std_mem.dylib`madrona::(anonymous namespace)::opNewAlignImpl(unsigned long, std::align_val_t) + 96 at op_new_delete.cpp:37:17, address = 0x00000001006f0f30
(lldb) run
Process 92194 launched: '/Users/aaravpandya/dev/gpudrive/build/viewer' (arm64)
OOM: 42
Process 92194 exited with status = 1 (0x00000001) 
(lldb) br s -f /Users/aaravpandya/dev/gpudrive/external/madrona/src/common/op_new_delete.cpp -l 14
Breakpoint 3: where = libmadrona_std_mem.dylib`madrona::(anonymous namespace)::opNewImpl(unsigned long) + 52 at op_new_delete.cpp:14:17, address = 0x00000001006f0d38
(lldb) run                                                                      Process 92198 launched: '/Users/aaravpandya/dev/gpudrive/build/viewer' (arm64)
OOM: 42
Process 92198 exited with status = 1 (0x00000001) 
(lldb) bt
error: Command requires a process which is currently stopped.

Also, I see you have removed the sorting code from the taskgraph. Do we not need to sort archetypes manually anymore? Is it handled inside madrona now ?

Thanks

@aaravpandya

aaravpandya commented Apr 12, 2025

Copy link
Copy Markdown
Collaborator Author

This is very confusing!! I removed the fprintf for the OOM statement, and it still prints it. Upon code recompile, I see that its recompiling the change -

(gpudrive) (base) ➜  build git:(ap_upstreanm) ✗ make -j12
[  0%] Built target madrona_vk_loader
[  1%] Built target madrona_moltenvk_lib
[  1%] Built target madrona_embree_lib
[  1%] Built target madrona_libdxc_shlib
[  2%] Built target generate_vk_dispatch
[  5%] Built target madrona_python_utils
[  5%] Built target madrona_mem
[  7%] Built target madrona_err
[  9%] Built target spv_reflect
[ 18%] Built target meshoptimizer
[ 20%] Built target simdjson
[ 21%] Building CXX object external/madrona/src/common/CMakeFiles/madrona_std_mem.dir/op_new_delete.cpp.o
[ 22%] Built target gtest
[ 29%] Built target nanobind-static
[ 30%] Built target gtest_main
[ 30%] Built target madrona_python_bindings
[ 43%] Built target glfw
[ 48%] Built target imgui_impl
[ 48%] Linking CXX shared library ../../../../libmadrona_std_mem.dylib
[ 48%] Built target madrona_std_mem
[ 52%] Built target madrona_common
[ 52%] Linking CXX shared library ../../../../../libmadrona_render_shader_compiler.dylib
[ 56%] Built target madrona_physics_assets
[ 56%] Built target madrona_bvh_builder
[ 56%] Built target madrona_json
[ 58%] Built target madrona_mw_core
[ 63%] Built target madrona_render_vk
[ 64%] Built target madrona_render_asset_processor
[ 65%] Built target madrona_physics_loader
[ 68%] Built target madrona_mw_cpu
[ 68%] Built target madrona_rendering_system
[ 70%] Built target madrona_render_shader_compiler
[ 74%] Built target madrona_importer
[ 78%] Built target madrona_mw_physics
[ 81%] Built target gpudrive_cpu_impl
[ 84%] Built target madrona_render_core
[ 85%] Built target madrona_render
[ 87%] Built target madrona_window
[ 88%] Built target gpudrive_mgr
[ 88%] Linking CXX shared module ../madrona_gpudrive.cpython-311-darwin.so
[ 88%] Linking CXX executable ../headless
[ 91%] Built target madrona_viz
[ 91%] Linking CXX executable my_tests
[ 92%] Linking CXX executable ../viewer
[ 95%] Built target madrona_gpudrive
[ 95%] Built target headless
[ 96%] Built target viewer
[100%] Built target my_tests
(gpudrive) (base) ➜  build git:(ap_upstreanm) ✗ ./viewer 
OOM: 42

Is this OOM not being triggered in op_new_delete.cpp ? This would explain why my breakpoints dont hit. I dont see any other place that outputs in that format.

Perhaps something is wrong with my dependencies ?

@aaravpandya

Copy link
Copy Markdown
Collaborator Author

@shacklettbp So I tried this on a different mac (my work laptop) and it seems like I am able to run the viewer. This is a specific problem only with my local. We are going to merge this.
I will probably do a complete clean reset of my laptop and see if that fixes the issue.

Thanks for all the help :)

@daphne-cornelisse daphne-cornelisse changed the base branch from main to dev_kshot April 15, 2025 21:19
@daphne-cornelisse daphne-cornelisse merged commit a53f462 into dev_kshot Apr 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants