HelloNappy is a Clef NPU program targeting the AMD XDNA2 NPU on Strix Halo. The host generates a triangular wave and a square wave, stages them into NPU buffers, dispatches an element-wise multiply across 4 AIE tiles, reads the result back, and prints all three signals as numeric output.
The host program reaches the NPU through XRT's modern hw_context dispatch path, which
is exposed only via C++ classes. The bindings are hand-rolled [<FidelityExtern>]
declarations against Itanium-mangled C++ symbols in libxrt_coreutil.so, with explicit
pimpl storage on the Clef side. There is no IPC, no C shim, no dlopen of a glue
library — Clef calls C++ constructors and destructors directly through the same
ExternCall path it uses for libc.
HelloNappy is structured as two Fidelity projects with distinct compilation targets:
.
├── HelloNappy.fidproj # Host build manifest (target = "cpu", output_kind = "console")
├── src/
│ ├── XrtCppBindings.clef # [<FidelityExtern>] bindings to xrt:: classes (C++ API)
│ ├── XrtBindings.clef # [<FidelityExtern>] bindings to xrt* (C API; reference only)
│ └── Program.clef # [<EntryPoint>] host dispatcher
├── kernel/
│ ├── HelloNappyKernel.fidproj # Kernel build manifest (target = "npu", output_kind = "kernel")
│ ├── src/
│ │ └── Kernel.clef # [<KernelModule>] ElementKernel<int32>
│ └── targets/ # HelloNappyKernel.xclbin + HelloNappyKernel_insts.bin
└── targets/ # HelloNappy executable + xclbin/instrs copied alongside
The two projects communicate through two files at runtime: HelloNappyKernel.xclbin
(the AIE bitstream + metadata) and HelloNappyKernel_insts.bin (the DPU instruction
stream). Both must be in the host's working directory when the executable runs.
The kernel is a pure binary function lifted to a spatial design:
type ElementKernel<'T> = {
Compute: 'T -> 'T -> 'T
Shape: { Elements: int; Grain: int }
}
let multiply (a: int32) (b: int32) : int32 = a * b
[<KernelModule>]
let emul : ElementKernel<int32> = {
Compute = multiply
Shape = { Elements = 64; Grain = 16 }
}emul is the entire algorithmic content of the kernel. The compiler derives
tile count from Elements / Grain (64 / 16 = 4 AIE tiles) and synthesizes the
spatial coordination — per-tile DMA, FIFO routing, and instruction sequencing —
through the Composer AIE backend.
This mirrors the FPGA contract used by HelloArty:
| Target | Application supplies | Compiler synthesizes |
|---|---|---|
| FPGA | Design<'S> = { InitialState; Step } |
flip-flops, comb logic, PWM, pin assignment |
| NPU | ElementKernel<'T> = { Compute; Shape } |
AIE tile layout, DMA, FIFOs, DPU instructions |
In both cases the user writes idiomatic ML; the hardware-specific lowering happens in the BackEnd.
XRT's NPU dispatch path requires six C++ classes, all of which follow the pimpl
idiom with a single std::shared_ptr<impl> member:
| Class | Size | Layout |
|---|---|---|
xrt::device |
16 | shared_ptr<device_impl> |
xrt::xclbin |
16 | shared_ptr<xclbin_impl> via detail::pimpl |
xrt::hw_context |
16 | shared_ptr<hw_context_impl> |
xrt::kernel |
16 | shared_ptr<kernel_impl> |
xrt::bo |
16 | shared_ptr<bo_impl> |
xrt::run |
16 | shared_ptr<run_impl> |
The host allocates 16 bytes of stack storage per object via NativePtr.stackalloc,
zero-initializes it (a zeroed region is equivalent to a null shared_ptr), and
calls the Itanium C1 constructor through a [<FidelityExtern>] whose target is
the mangled symbol:
[<FidelityExtern("xrt_coreutil", "_ZN3xrt6deviceC1Ej")>]
let deviceConstruct (this: nativeint) (deviceIndex: uint) : unit =
NativeDefault.zeroed ()Subsequent methods take const& parameters, which at ABI level are pointers to
the 16-byte storage. Destruction is the matching D1 destructor invoked in reverse
construction order. The full ABI inventory — including the SysV sret convention
for register_xclbin (which returns xrt::uuid by value) and the GCC __cxx11
SSO layout for the std::string passed to xrt::kernel's constructor — lives
in src/XrtCppBindings.clef.
xrt::xclbin::~xclbin() is compiler-synthesized inline as the destructor for
the embedded shared_ptr<xclbin_impl> and is not exported from
libxrt_coreutil.so's .dynsym. The current bindings leak the xclbin object
on process exit; for a short-lived dispatch program this is acceptable. Phase 2
(Farscape OnClass) will generate the shared_ptr teardown sequence directly,
rather than calling through a missing symbol. See
docs/cpp-pimpl-lifetime-design.md for the
full analysis.
The host walks the XRT hw_context path step by step. Each step is a direct
func.call to a [<FidelityExtern>]-bound C++ entry point; no glue layer
intervenes.
xrt::device(0)
-> mmap("HelloNappyKernel.xclbin")
-> xrt::xclbin(const axlf*) // buffer ctor, not string_view (XRT 2.21 bug)
-> device.register_xclbin(xclbin) // returns xrt::uuid via hidden sret
-> xrtXclbinGetUUID // C API, sidesteps struct-return ABI
-> xrt::hw_context(device, uuid, shared=1)
-> xrt::kernel(hw_context, "MLIR_AIE")
-> xrt::bo for instructions (XCL_BO_FLAGS_CACHEABLE = 0x1000000)
-> xrt::bo for A, B, out (XRT_BO_FLAGS_HOST_ONLY = 0x2000000)
-> map + sync(host->device)
-> xrt::run(kernel)
arg 0: opcode = 0 (uint64 scalar)
arg 1: instr = BO
arg 2: ninstr = uint32 (instr_bytes / 4)
arg 3..5: A, B, out (BO)
-> run.start() / run.wait(30s)
-> sync(device->host) + read back
-> destruct in reverse construction order
A few of these steps exist for ABI reasons rather than convenience:
- xclbin from
const axlf*, notstring_view. Thestring_viewctor throwsstd::bad_allocon XRT 2.21.0 (confirmed via C test harness in targets/). The host opens the file viaFidelity.Libc.IO,mmaps it, and hands the buffer to theconst axlf*ctor instead. File I/O stays in Clef's control; no C++ exceptions are raised for missing files. - UUID extraction via the C API.
device.register_xclbinreturnsxrt::uuidby value. Under SysV x86_64,xrt::uuidis classified MEMORY (non-trivial ctor), which means the callee expects a hidden sret pointer shifted intordi. The binding handles that, but for the subsequentxclbin::get_uuid()lookup the C API (xrtXclbinGetUUID) is simpler — it writes 16 bytes into a caller buffer with a normal calling convention. std::stringbuilt by hand.xrt::kernel's constructor takesconst std::string&(GCC__cxx11). The host stack-allocates 32 bytes, writes the SSO layout for"MLIR_AIE"directly (data pointer at offset 0, length8at offset 8, the 8 chars + null at offset 16), and passes the address.
Today the host calls destructors explicitly, in a series of if *Alive then *Destruct … blocks at the end of main. This is Phase 1: hand-placed
lifecycle, written once and reviewed against the construction order.
Phase 2 moves the placement into the compiler. The PSG's escape analysis
already classifies allocating sites into StackScoped, EscapesViaReturn,
EscapesViaClosure, and EscapesViaByRef. A new PimplLifecycle coeffect
and a pimpl-lifecycle-lowering MLIR plugin will let the witness emit the
destructor func.call at the correct scope boundary automatically. The
design analysis is in:
- docs/cpp-pimpl-lifetime-design.md — ABI verification, coeffect carrier design, ExternCall reuse, byref safety
- docs/psg-destructor-lifetime-inference.md — how escape kinds map to destructor placement
- docs/farscape-pilot-cpp-navigation.md —
classifying mixed C/C++ driver libraries and routing
OnClassthrough the Pilot - docs/plan-farscape-sret-detection.md — detecting the SysV sret convention for value-returning C++ methods
Once Phase 2 lands, src/XrtCppBindings.clef and the manual cleanup block in src/Program.clef become a regression test for whatever Farscape generates.
Clef source (host) Clef source (kernel)
-> CCS front-end -> CCS front-end
-> Composer (CPU lowering) -> Composer (AIE backend)
-> standard MLIR -> LLVM IR -> aie_prj / MLIR_AIE
-> clang -> ELF + libxrt_coreutil link -> Vitis / Peano -> .xclbin + insts.bin
# Build the NPU kernel (produces HelloNappyKernel.xclbin + HelloNappyKernel_insts.bin)
cd kernel
/path/to/Composer compile HelloNappyKernel.fidproj
# Build the host
cd ..
/path/to/Composer compile HelloNappy.fidproj
# Stage both kernel artifacts next to the host executable, then run
cp kernel/targets/HelloNappyKernel.xclbin targets/
cp kernel/targets/HelloNappyKernel_insts.bin targets/
cd targets
./HelloNappylibxrt_coreutil.so is loaded at runtime via dlopen/dlsym — FidelityExtern
bindings with library != "c" no longer need -l flags, so the host
fidproj has no [link] section.
- Inputs (host-generated, 64 int32 samples each):
- Signal A: triangular wave, period 32, peak ±100
- Signal B: square wave, period 16, amplitude ±1
- Output:
A[i] * B[i]computed on the NPU across 4 AIE tiles (16 elements per tile). - Display: all three signals printed as
[i] = valuelines on stdout. Exit code0on success,1on any failure with aFAILED: …diagnostic.
- AMD Strix Halo with XDNA2 NPU
- Linux x86_64 with the
amdxdnadriver loaded - XRT 2.21.0 (
libxrt_coreutil.sodiscoverable viaLD_LIBRARY_PATHorldconfig) - An xclbin produced by the Composer AIE backend; reference build artifacts live in kernel/targets/ and targets/.