Update madrona upstream by aaravpandya · Pull Request #248 · Emerge-Lab/gpudrive

aaravpandya · 2024-09-23T22:04:45Z

No description provided.

aaravpandya · 2025-02-08T15:53:44Z

Hi @shacklettbp,
After upgrading to the latest commit of Madrona, I am unable to run the viewer. Everything else is working as expected. While running the viewer, on the WindowManager initialization I get the following error -

(lldb) next
Process 5870 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = step over
    frame #0: 0x0000000100001f48 viewer`main(argc=2, argv=0x000000016fdff348) at viewer.cpp:69:19
   66  	        true;
   67  	#endif
   68  	
-> 69  	    WindowManager wm {};
   70  	    WindowHandle window = wm.makeWindow("GPUDrive", 1920, 1080);
   71  	    render::GPUHandle render_gpu = wm.initGPU(0, { window.get() });
   72  	
Target 0: (viewer) stopped.
(lldb) next
Process 5870 stopped
* thread #2, stop reason = breakpoint 2.3
    frame #0: 0x000000018b890648 Foundation`-[NSThread main]
Foundation`-[NSThread main]:
->  0x18b890648 <+0>:  pacibsp 
    0x18b89064c <+4>:  stp    x29, x30, [sp, #-0x10]!
    0x18b890650 <+8>:  mov    x29, sp
    0x18b890654 <+12>: ldr    x8, [x0, #0x8]
Target 0: (viewer) stopped.
(lldb) continue
Process 5870 resuming
OOM: 42
Process 5870 exited with status = 1 (0x00000001)

While I am getting this error on MacOS, but I also see it on linux.

aaravpandya · 2025-02-08T17:03:31Z

Also, I am unable to get the headless to run on CUDA.

(base) aarav@emerge2-desktop:~/gpudrive/build$ ./headless CUDA 1
Compiler Flags:
-I/home/aarav/gpudrive/external/madrona/src/mw/device/include
-I/home/aarav/gpudrive/external/madrona/src/common/../../include
-I/usr/local/cuda-12.8/targets/x86_64-linux/include
-std=c++20
-default-device
-rdc=true
-use_fast_math
-DMADRONA_GPU_MODE=1
-DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_CPP
-DCCCL_DISABLE_BF16_SUPPORT=1
-DCUB_DISABLE_BF16_SUPPORT=1
-arch
sm_89
-DMADRONA_MWGPU_NUM_SMS=(76_i32)
-DMADRONA_MWGPU_MAX_BLOCKS_PER_SM=(1_i32)
-dopt=on
--extra-device-vectorization
-lineinfo
-dlto
-DMADRONA_MWGPU_LTO_MODE=1
-DMADRONA_MWGPU_TASKGRAPH=1

Linker Flags:
-arch=sm_89
-ftz=1
-prec-div=0
-prec-sqrt=0
-fma=1
-optimize-unused-variables
-lineinfo
-lto
-verbose

Compiling GPU engine code:
/home/aarav/gpudrive/external/madrona/src/mw/device/memory.cpp
/home/aarav/gpudrive/external/madrona/src/mw/device/state.cpp
/home/aarav/gpudrive/external/madrona/src/mw/device/crash.cpp
/home/aarav/gpudrive/external/madrona/src/mw/device/consts.cpp
/home/aarav/gpudrive/external/madrona/src/mw/device/taskgraph.cpp
/home/aarav/gpudrive/external/madrona/src/mw/device/taskgraph_utils.cpp
/home/aarav/gpudrive/external/madrona/src/mw/device/sort_archetype.cpp
/home/aarav/gpudrive/external/madrona/src/mw/device/host_print.cpp
/home/aarav/gpudrive/external/madrona/src/mw/../common/hashmap.cpp
/home/aarav/gpudrive/external/madrona/src/mw/../common/navmesh.cpp
/home/aarav/gpudrive/external/madrona/src/mw/../core/base.cpp
/home/aarav/gpudrive/external/madrona/src/mw/../physics/physics.cpp
/home/aarav/gpudrive/external/madrona/src/mw/../physics/geo.cpp
/home/aarav/gpudrive/external/madrona/src/mw/../physics/xpbd.cpp
/home/aarav/gpudrive/external/madrona/src/mw/../physics/tgs.cpp
/home/aarav/gpudrive/external/madrona/src/mw/../physics/narrowphase.cpp
/home/aarav/gpudrive/external/madrona/src/mw/../physics/broadphase.cpp
/home/aarav/gpudrive/external/madrona/src/mw/../render/ecs_system.cpp
/home/aarav/gpudrive/src/sim.cpp
/home/aarav/gpudrive/src/level_gen.cpp
/home/aarav/gpudrive/src/level_gen.cpp(280): warning #177-D: function "gpudrive::createFloorPlane" was declared but never referenced
  static void createFloorPlane(Engine &ctx)
              ^

Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"



CUDA linking Failed!
/home/aarav/gpudrive/src/level_gen.cpp: link error: error: Linking globals named 'initBVHParams': symbol multiply defined!

ERROR 9 in nvvmCompileProgram callback


Error at /home/aarav/gpudrive/external/madrona/src/mw/cuda_exec.cpp:703 in auto madrona::compileCode(const char **, int64_t, const char **, int64_t, const char **, int64_t, const char **, int64_t, const MegakernelConfig *, int64_t, CompileConfig::OptMode, ExecutorMode, bool)::(anonymous class)::operator()(nvJitLinkResult) const
nvJitLink error: NVVM compilation error
Aborted (core dumped)

shacklettbp · 2025-02-11T21:17:00Z

@aaravpandya I checked out your branch and I get a segfault in loadPhysicsObjects when running ./headless CUDA 1. Am I missing some files or is this another bug?

shacklettbp · 2025-02-11T21:17:51Z

Can you also give me the command to run the viewer?

aaravpandya · 2025-02-11T21:22:05Z

It looks like another bug. I have been getting that sporadically too while running on CPU.

To run the viewer, its simply ./viewer. To run on cuda, its ./viewer 1 --cuda. I need to reformat the the number of worlds param here, we determine the number of worlds through the length of the maps provided. The path to an example map is hardcoded in.

shacklettbp · 2025-02-11T21:46:18Z

Ok, I think I fixed the loadPhysicsObjects bug, it was a Madrona issue.

By the way for future reference:

    // Define the texture paths
    const char *texture_paths[] = {
        (std::filesystem::path(DATA_DIR) / "green_grid.png").string().c_str(),
        (std::filesystem::path(DATA_DIR) / "smile.png").string().c_str()
    };

This is unsafe because the std::string (return value of .string()) will be deconstructed immediately after this statement finishes, leaving the const char * pointers dangling. You need to put the std::strings in an array that keeps them in scope and then have a separate array of const char * that point to that array, unfortunately (I'll push this fix).

shacklettbp · 2025-02-12T22:00:20Z

@aaravpandya After my latest commit this branch works for me now. Let me know if you run into any problems.

The one thing I notice is that when simulating multiple worlds, only the first world shows anything in the viewer, on both CPU and GPU backends. Has the viewer always behaved this way in GPUDrive? This isn't the case for our other environments.

aaravpandya · 2025-02-12T23:27:56Z

Thank you so much @shacklettbp . I see that it now works on CUDA and CPU. Was not able to fully test the viewer on ubuntu, but @daphne-cornelisse ran it and it went past the OOM error from previous.

However, on mac, I am still getting the OOM error. I dont think I am out of memory because I have 48 gigs of ram.

Regarding the viewer, I believe there was drop down from where we can select the world we want to view. I was actually going to ask you if you could provide us with some release notes / documentation regarding the new changes/features in madrona that we can use. For eg, I see that madrona has a new batch renderer and some new named tensor interfaces. Would like to know how we can use them.

shacklettbp · 2025-02-13T01:41:09Z

What is the full backtrace for the error you're getting on macos? It works on my macbook.

Right, when I select anything other than world 0 in that drop down in the viewer, the world is blank (no entities).

There are no release notes & minimal documentation unfortunately. We added the batch renderer as part of a SIGGRAPH Asia paper: https://madrona-engine.github.io/renderer.html. If you're interested in using it, probably it would make sense to setup a zoom call with everyone (me, anyone on your side, and Luc (the first author on the batch renderer paper)).

The named tensor thing is kind of a TensorDict like interface, I'm using it for JAX interop currently. I don't think it's useful for you guys in its current state, are there any features you need in that space?

aaravpandya · 2025-02-13T16:24:18Z

So I get this same backtrace as before. I made sure I am on the latest commit btw and all submodules are updated. It fails at creating the WindowManager object.

(lldb) 
Process 98495 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = step over
    frame #0: 0x0000000100001f48 viewer`main(argc=2, argv=0x000000016fdff348) at viewer.cpp:69:19
   66  	        true;
   67  	#endif
   68  	
-> 69  	    WindowManager wm {};
   70  	    WindowHandle window = wm.makeWindow("GPUDrive", 1920, 1080);
   71  	    render::GPUHandle render_gpu = wm.initGPU(0, { window.get() });
   72  	
Target 0: (viewer) stopped.
(lldb) next
Process 98495 stopped
* thread #2, stop reason = breakpoint 1.11
    frame #0: 0x000000018512c648 Foundation`-[NSThread main]
Foundation`-[NSThread main]:
->  0x18512c648 <+0>:  pacibsp 
    0x18512c64c <+4>:  stp    x29, x30, [sp, #-0x10]!
    0x18512c650 <+8>:  mov    x29, sp
    0x18512c654 <+12>: ldr    x8, [x0, #0x8]
Target 0: (viewer) stopped.
(lldb) continue
Process 98495 resuming
OOM: 42
Process 98495 exited with status = 1 (0x00000001)

Perhaps I am on some incompatible XCode or MacOS version? I am on Version 16.2 for XCode and 15.0 Sequoia for MacOS version.

shacklettbp · 2025-02-13T19:03:51Z

I see, can you run it again under the debugger with the latest commit / submodule update and give the debugger backtrace? You should be able to get a proper back trace now.

aaravpandya · 2025-04-12T17:25:44Z

Hi @shacklettbp Sorry, I got distracted and didnt work on this PR more.
I gave it a try and still dont see a backtrace -

(lldb) br s -f /Users/aaravpandya/dev/gpudrive/external/madrona/src/common/op_new_delete.cpp -l 37
Breakpoint 2: where = libmadrona_std_mem.dylib`madrona::(anonymous namespace)::opNewAlignImpl(unsigned long, std::align_val_t) + 96 at op_new_delete.cpp:37:17, address = 0x00000001006f0f30
(lldb) run
Process 92194 launched: '/Users/aaravpandya/dev/gpudrive/build/viewer' (arm64)
OOM: 42
Process 92194 exited with status = 1 (0x00000001) 
(lldb) br s -f /Users/aaravpandya/dev/gpudrive/external/madrona/src/common/op_new_delete.cpp -l 14
Breakpoint 3: where = libmadrona_std_mem.dylib`madrona::(anonymous namespace)::opNewImpl(unsigned long) + 52 at op_new_delete.cpp:14:17, address = 0x00000001006f0d38
(lldb) run                                                                      Process 92198 launched: '/Users/aaravpandya/dev/gpudrive/build/viewer' (arm64)
OOM: 42
Process 92198 exited with status = 1 (0x00000001) 
(lldb) bt
error: Command requires a process which is currently stopped.

Also, I see you have removed the sorting code from the taskgraph. Do we not need to sort archetypes manually anymore? Is it handled inside madrona now ?

Thanks

aaravpandya · 2025-04-12T17:44:33Z

This is very confusing!! I removed the fprintf for the OOM statement, and it still prints it. Upon code recompile, I see that its recompiling the change -

(gpudrive) (base) ➜  build git:(ap_upstreanm) ✗ make -j12
[  0%] Built target madrona_vk_loader
[  1%] Built target madrona_moltenvk_lib
[  1%] Built target madrona_embree_lib
[  1%] Built target madrona_libdxc_shlib
[  2%] Built target generate_vk_dispatch
[  5%] Built target madrona_python_utils
[  5%] Built target madrona_mem
[  7%] Built target madrona_err
[  9%] Built target spv_reflect
[ 18%] Built target meshoptimizer
[ 20%] Built target simdjson
[ 21%] Building CXX object external/madrona/src/common/CMakeFiles/madrona_std_mem.dir/op_new_delete.cpp.o
[ 22%] Built target gtest
[ 29%] Built target nanobind-static
[ 30%] Built target gtest_main
[ 30%] Built target madrona_python_bindings
[ 43%] Built target glfw
[ 48%] Built target imgui_impl
[ 48%] Linking CXX shared library ../../../../libmadrona_std_mem.dylib
[ 48%] Built target madrona_std_mem
[ 52%] Built target madrona_common
[ 52%] Linking CXX shared library ../../../../../libmadrona_render_shader_compiler.dylib
[ 56%] Built target madrona_physics_assets
[ 56%] Built target madrona_bvh_builder
[ 56%] Built target madrona_json
[ 58%] Built target madrona_mw_core
[ 63%] Built target madrona_render_vk
[ 64%] Built target madrona_render_asset_processor
[ 65%] Built target madrona_physics_loader
[ 68%] Built target madrona_mw_cpu
[ 68%] Built target madrona_rendering_system
[ 70%] Built target madrona_render_shader_compiler
[ 74%] Built target madrona_importer
[ 78%] Built target madrona_mw_physics
[ 81%] Built target gpudrive_cpu_impl
[ 84%] Built target madrona_render_core
[ 85%] Built target madrona_render
[ 87%] Built target madrona_window
[ 88%] Built target gpudrive_mgr
[ 88%] Linking CXX shared module ../madrona_gpudrive.cpython-311-darwin.so
[ 88%] Linking CXX executable ../headless
[ 91%] Built target madrona_viz
[ 91%] Linking CXX executable my_tests
[ 92%] Linking CXX executable ../viewer
[ 95%] Built target madrona_gpudrive
[ 95%] Built target headless
[ 96%] Built target viewer
[100%] Built target my_tests
(gpudrive) (base) ➜  build git:(ap_upstreanm) ✗ ./viewer 
OOM: 42

Is this OOM not being triggered in op_new_delete.cpp ? This would explain why my breakpoints dont hit. I dont see any other place that outputs in that format.

Perhaps something is wrong with my dependencies ?

aaravpandya · 2025-04-12T17:53:52Z

@shacklettbp So I tried this on a different mac (my work laptop) and it seems like I am able to run the viewer. This is a specific problem only with my local. We are going to merge this.
I will probably do a complete clean reset of my laptop and see if that fixes the issue.

Thanks for all the help :)

aaravpandya added 6 commits September 23, 2024 18:04

Update madrona upstream

efb09b7

Merge branch 'main' into ap_upstreanm

e484027

Update to newer madrona version

ad17b58

Free images after texture loading

e4ecad5

Incorrect api call

2a51463

Update madrona

022b435

aaravpandya requested review from daphne-cornelisse, eugenevinitsky and shacklettbp February 8, 2025 15:43

Replace max and min

8b4de78

shacklettbp added 2 commits February 11, 2025 15:58

Updates for latest madrona

1a13c83

Update madrona

392f3d0

Update madrona

57dd925

shacklettbp and others added 4 commits February 13, 2025 12:18

Update madrona

3f7152a

Fix road type encoding (#375)

b925937

new agent init order (#408)

9275c5b

Merge branch 'main' into ap_upstreanm

049212b

Update madrona

7d1a5f9

daphne-cornelisse changed the base branch from main to dev_kshot April 15, 2025 21:19

daphne-cornelisse force-pushed the dev_kshot branch from 8ada1fe to edbaedd Compare April 15, 2025 21:21

daphne-cornelisse merged commit a53f462 into dev_kshot Apr 15, 2025

Uh oh!

Conversation

aaravpandya commented Sep 23, 2024

Uh oh!

aaravpandya commented Feb 8, 2025

Uh oh!

aaravpandya commented Feb 8, 2025

Uh oh!

shacklettbp commented Feb 11, 2025

Uh oh!

shacklettbp commented Feb 11, 2025

Uh oh!

aaravpandya commented Feb 11, 2025

Uh oh!

shacklettbp commented Feb 11, 2025

Uh oh!

shacklettbp commented Feb 12, 2025

Uh oh!

aaravpandya commented Feb 12, 2025

Uh oh!

shacklettbp commented Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aaravpandya commented Feb 13, 2025

Uh oh!

shacklettbp commented Feb 13, 2025

Uh oh!

aaravpandya commented Apr 12, 2025

Uh oh!

aaravpandya commented Apr 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aaravpandya commented Apr 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

shacklettbp commented Feb 13, 2025 •

edited

Loading

aaravpandya commented Apr 12, 2025 •

edited

Loading