HelloNappy

HelloNappy is a Clef NPU program targeting the AMD XDNA2 NPU on Strix Halo. The host generates a triangular wave and a square wave, stages them into NPU buffers, dispatches an element-wise multiply across 4 AIE tiles, reads the result back, and prints all three signals as numeric output.

The host program reaches the NPU through XRT's modern hw_context dispatch path, which is exposed only via C++ classes. The bindings are hand-rolled [<FidelityExtern>] declarations against Itanium-mangled C++ symbols in libxrt_coreutil.so, with explicit pimpl storage on the Clef side. There is no IPC, no C shim, no dlopen of a glue library — Clef calls C++ constructors and destructors directly through the same ExternCall path it uses for libc.

Architecture

HelloNappy is structured as two Fidelity projects with distinct compilation targets:

.
├── HelloNappy.fidproj             # Host build manifest  (target = "cpu",   output_kind = "console")
├── src/
│   ├── XrtCppBindings.clef        # [<FidelityExtern>] bindings to xrt:: classes (C++ API)
│   ├── XrtBindings.clef           # [<FidelityExtern>] bindings to xrt* (C API; reference only)
│   └── Program.clef               # [<EntryPoint>] host dispatcher
├── kernel/
│   ├── HelloNappyKernel.fidproj   # Kernel build manifest (target = "npu", output_kind = "kernel")
│   ├── src/
│   │   └── Kernel.clef            # [<KernelModule>] ElementKernel<int32>
│   └── targets/                   # HelloNappyKernel.xclbin + HelloNappyKernel_insts.bin
└── targets/                       # HelloNappy executable + xclbin/instrs copied alongside

The two projects communicate through two files at runtime: HelloNappyKernel.xclbin (the AIE bitstream + metadata) and HelloNappyKernel_insts.bin (the DPU instruction stream). Both must be in the host's working directory when the executable runs.

NPU Design Model

The kernel is a pure binary function lifted to a spatial design:

type ElementKernel<'T> = {
    Compute: 'T -> 'T -> 'T
    Shape: { Elements: int; Grain: int }
}

let multiply (a: int32) (b: int32) : int32 = a * b

[<KernelModule>]
let emul : ElementKernel<int32> = {
    Compute = multiply
    Shape = { Elements = 64; Grain = 16 }
}

emul is the entire algorithmic content of the kernel. The compiler derives tile count from Elements / Grain (64 / 16 = 4 AIE tiles) and synthesizes the spatial coordination — per-tile DMA, FIFO routing, and instruction sequencing — through the Composer AIE backend.

This mirrors the FPGA contract used by HelloArty:

Target	Application supplies	Compiler synthesizes
FPGA	`Design<'S> = { InitialState; Step }`	flip-flops, comb logic, PWM, pin assignment
NPU	`ElementKernel<'T> = { Compute; Shape }`	AIE tile layout, DMA, FIFOs, DPU instructions

In both cases the user writes idiomatic ML; the hardware-specific lowering happens in the BackEnd.

XRT C++ Binding Surface

XRT's NPU dispatch path requires six C++ classes, all of which follow the pimpl idiom with a single std::shared_ptr<impl> member:

Class	Size	Layout
`xrt::device`	16	`shared_ptr<device_impl>`
`xrt::xclbin`	16	`shared_ptr<xclbin_impl>` via `detail::pimpl`
`xrt::hw_context`	16	`shared_ptr<hw_context_impl>`
`xrt::kernel`	16	`shared_ptr<kernel_impl>`
`xrt::bo`	16	`shared_ptr<bo_impl>`
`xrt::run`	16	`shared_ptr<run_impl>`

The host allocates 16 bytes of stack storage per object via NativePtr.stackalloc, zero-initializes it (a zeroed region is equivalent to a null shared_ptr), and calls the Itanium C1 constructor through a [<FidelityExtern>] whose target is the mangled symbol:

[<FidelityExtern("xrt_coreutil", "_ZN3xrt6deviceC1Ej")>]
let deviceConstruct (this: nativeint) (deviceIndex: uint) : unit =
    NativeDefault.zeroed ()

Subsequent methods take const& parameters, which at ABI level are pointers to the 16-byte storage. Destruction is the matching D1 destructor invoked in reverse construction order. The full ABI inventory — including the SysV sret convention for register_xclbin (which returns xrt::uuid by value) and the GCC __cxx11 SSO layout for the std::string passed to xrt::kernel's constructor — lives in src/XrtCppBindings.clef.

One known gap

xrt::xclbin::~xclbin() is compiler-synthesized inline as the destructor for the embedded shared_ptr<xclbin_impl> and is not exported from libxrt_coreutil.so's .dynsym. The current bindings leak the xclbin object on process exit; for a short-lived dispatch program this is acceptable. Phase 2 (Farscape OnClass) will generate the shared_ptr teardown sequence directly, rather than calling through a missing symbol. See docs/cpp-pimpl-lifetime-design.md for the full analysis.

Dispatch Pipeline

The host walks the XRT hw_context path step by step. Each step is a direct func.call to a [<FidelityExtern>]-bound C++ entry point; no glue layer intervenes.

xrt::device(0)
  -> mmap("HelloNappyKernel.xclbin")
  -> xrt::xclbin(const axlf*)          // buffer ctor, not string_view (XRT 2.21 bug)
  -> device.register_xclbin(xclbin)    // returns xrt::uuid via hidden sret
  -> xrtXclbinGetUUID                  // C API, sidesteps struct-return ABI
  -> xrt::hw_context(device, uuid, shared=1)
  -> xrt::kernel(hw_context, "MLIR_AIE")
  -> xrt::bo for instructions   (XCL_BO_FLAGS_CACHEABLE = 0x1000000)
  -> xrt::bo for A, B, out      (XRT_BO_FLAGS_HOST_ONLY  = 0x2000000)
  -> map + sync(host->device)
  -> xrt::run(kernel)
       arg 0: opcode  = 0       (uint64 scalar)
       arg 1: instr   = BO
       arg 2: ninstr  = uint32  (instr_bytes / 4)
       arg 3..5: A, B, out      (BO)
  -> run.start() / run.wait(30s)
  -> sync(device->host) + read back
  -> destruct in reverse construction order

A few of these steps exist for ABI reasons rather than convenience:

xclbin from const axlf*, not string_view. The string_view ctor throws std::bad_alloc on XRT 2.21.0 (confirmed via C test harness in targets/). The host opens the file via Fidelity.Libc.IO, mmaps it, and hands the buffer to the const axlf* ctor instead. File I/O stays in Clef's control; no C++ exceptions are raised for missing files.
UUID extraction via the C API. device.register_xclbin returns xrt::uuid by value. Under SysV x86_64, xrt::uuid is classified MEMORY (non-trivial ctor), which means the callee expects a hidden sret pointer shifted into rdi. The binding handles that, but for the subsequent xclbin::get_uuid() lookup the C API (xrtXclbinGetUUID) is simpler — it writes 16 bytes into a caller buffer with a normal calling convention.
std::string built by hand. xrt::kernel's constructor takes const std::string& (GCC __cxx11). The host stack-allocates 32 bytes, writes the SSO layout for "MLIR_AIE" directly (data pointer at offset 0, length 8 at offset 8, the 8 chars + null at offset 16), and passes the address.

Pimpl Lifecycle: Phase 1 vs Phase 2

Today the host calls destructors explicitly, in a series of if *Alive then *Destruct … blocks at the end of main. This is Phase 1: hand-placed lifecycle, written once and reviewed against the construction order.

Phase 2 moves the placement into the compiler. The PSG's escape analysis already classifies allocating sites into StackScoped, EscapesViaReturn, EscapesViaClosure, and EscapesViaByRef. A new PimplLifecycle coeffect and a pimpl-lifecycle-lowering MLIR plugin will let the witness emit the destructor func.call at the correct scope boundary automatically. The design analysis is in:

docs/cpp-pimpl-lifetime-design.md — ABI verification, coeffect carrier design, ExternCall reuse, byref safety
docs/psg-destructor-lifetime-inference.md — how escape kinds map to destructor placement
docs/farscape-pilot-cpp-navigation.md — classifying mixed C/C++ driver libraries and routing OnClass through the Pilot
docs/plan-farscape-sret-detection.md — detecting the SysV sret convention for value-returning C++ methods

Once Phase 2 lands, src/XrtCppBindings.clef and the manual cleanup block in src/Program.clef become a regression test for whatever Farscape generates.

Build

Clef source (host)                       Clef source (kernel)
  -> CCS front-end                         -> CCS front-end
  -> Composer (CPU lowering)               -> Composer (AIE backend)
  -> standard MLIR -> LLVM IR              -> aie_prj / MLIR_AIE
  -> clang -> ELF + libxrt_coreutil link   -> Vitis / Peano -> .xclbin + insts.bin

# Build the NPU kernel (produces HelloNappyKernel.xclbin + HelloNappyKernel_insts.bin)
cd kernel
/path/to/Composer compile HelloNappyKernel.fidproj

# Build the host
cd ..
/path/to/Composer compile HelloNappy.fidproj

# Stage both kernel artifacts next to the host executable, then run
cp kernel/targets/HelloNappyKernel.xclbin       targets/
cp kernel/targets/HelloNappyKernel_insts.bin    targets/
cd targets
./HelloNappy

libxrt_coreutil.so is loaded at runtime via dlopen/dlsym — FidelityExtern bindings with library != "c" no longer need -l flags, so the host fidproj has no [link] section.

Runtime Contract

Inputs (host-generated, 64 int32 samples each):
- Signal A: triangular wave, period 32, peak ±100
- Signal B: square wave, period 16, amplitude ±1
Output: A[i] * B[i] computed on the NPU across 4 AIE tiles (16 elements per tile).
Display: all three signals printed as [i] = value lines on stdout. Exit code 0 on success, 1 on any failure with a FAILED: … diagnostic.

Target Hardware

AMD Strix Halo with XDNA2 NPU
Linux x86_64 with the amdxdna driver loaded
XRT 2.21.0 (libxrt_coreutil.so discoverable via LD_LIBRARY_PATH or ldconfig)
An xclbin produced by the Composer AIE backend; reference build artifacts live in kernel/targets/ and targets/.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HelloNappy

Architecture

NPU Design Model

XRT C++ Binding Surface

One known gap

Dispatch Pipeline

Pimpl Lifecycle: Phase 1 vs Phase 2

Build

Runtime Contract

Target Hardware

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
docs		docs
kernel		kernel
src		src
targets		targets
HelloNappy.fidproj		HelloNappy.fidproj
HelloNappy.fidproj.bak		HelloNappy.fidproj.bak
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

HelloNappy

Architecture

NPU Design Model

XRT C++ Binding Surface

One known gap

Dispatch Pipeline

Pimpl Lifecycle: Phase 1 vs Phase 2

Build

Runtime Contract

Target Hardware

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages