fix: add reachabilityFence to prevent premature GC of Arena buffers (APPSEC-62784)#198
fix: add reachabilityFence to prevent premature GC of Arena buffers (APPSEC-62784)#198jandro996 wants to merge 26 commits into
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b26756d35d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
When runWafContext throws (e.g. TimeoutWafException on budget exhaustion or a native error), the JIT still has freedom to collect lease/ephemeralLease before the call completes if the fences are only in the try block. Moving them into the finally block ensures they run on both normal and exceptional returns from runWafContext. Addresses review comment on PR #198.
|
🎯 Code Coverage (details) 🔗 Commit SHA: 960957f | Docs | Datadog PR Page | Give us feedback! |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: dbafb88d55
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
On JDK 21.0.8+ with ZGC Generational and JDK 25, the JIT compiler can elide references to WafContext.lease and ephemeralLease after the last Java-visible use, making the Arena's StringsSegment DirectByteBuffers eligible for GC while ddwaf_run() is still reading native pointers into them. The concurrent Cleaner thread can then free the underlying native memory, causing a SIGSEGV (GPF in memcpy of musl libc at ddwaf_run+0x168). Add Reference.reachabilityFence(this.lease) and Reference.reachabilityFence(ephemeralLease) immediately after the runWafContext() JNI call returns, ensuring both Arenas and all their StringsSegment DirectByteBuffers remain strongly reachable until past the ddwaf_run boundary. The bug was observed in production with dd-trace-java 1.62.0 (APPSEC-62784): - 40 crashes on JDK 25.x, crash rate ~6.2e-4 - 7 crashes on JDK 21.0.8+, crash rate ~3.5e-5 - Zero crashes on JDK 17/11/8 despite significant traffic The regression test (ReachabilityFenceTest) loads rule sets that match server.request.headers.no_cookies with key_path values absent from normal requests (mirroring dd-trace-java 1.62.0 bundled crs-944-140/dog-920-100), runs 2000 evaluations per test with a concurrent GC pressure thread, and verifies no crash and correct DDWAF_OK results. Fixes: APPSEC-62784
When runWafContext throws (e.g. TimeoutWafException on budget exhaustion or a native error), the JIT still has freedom to collect lease/ephemeralLease before the call completes if the fences are only in the try block. Moving them into the finally block ensures they run on both normal and exceptional returns from runWafContext. Addresses review comment on PR #198.
Reference.reachabilityFence is Java 9+, causing compilation failures when building with JDK 8 (JMH build and release jar CI jobs). Replace with a write to a static volatile field, which is a memory barrier the JIT cannot elide - equivalent semantics, compatible with Java 8. Also apply spotless formatting to ReachabilityFenceTest.groovy.
- Remove unused imports (JsonSlurper, Executors, TimeUnit) - Rename buildRuleKeyPathHeadersAbsent -> keyPathAbsentHeaderRuleset to avoid FactoryMethodName violation (build.* in non-Builder class) - Move private helper standardRequestBundle after test methods to fix PublicMethodsBeforeNonPublicMethods violations - Set runBudget = 60s in stress tests so the native ddwaf_run timeout is not capped at the 500ms DD_APPSEC_WAF_TIMEOUT default, which is too tight under ASAN where native operations are much slower
WafContext.java had a UTF-8 em dash (U+2014) in the class Javadoc that caused 'unmappable character for encoding ASCII' warnings on CentOS 6 and IBM Semeru containers (LC_ALL=C, ASCII locale). ReachabilityFenceTest was timing out after 6h on ubuntu2204-semeru17: - IBM J9 System.gc() is synchronous and slower than HotSpot; a tight loop with Thread.sleep(0) overwhelmed the JVM with continuous full GC - 2000 iterations at ~1s/iter on J9 = hours; reduced to 500 - Pool churn test: 200 -> 50 outer iterations (still 250 WAF runs) - GC thread sleep: 0ms -> 5ms to avoid saturating the GC scheduler
9ca3e73 to
0a991d9
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #198 +/- ##
============================================
+ Coverage 57.55% 57.66% +0.10%
Complexity 175 175
============================================
Files 34 34
Lines 3645 3647 +2
Branches 819 819
============================================
+ Hits 2098 2103 +5
+ Misses 973 970 -3
Partials 574 574
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Are the tests guaranteed to fail when the fix is absent?, sleeping and calling gc are often code smells that the test might not be always testing what it's expected from it.
Without warmup, WafContext.run() is never compiled by C2 (threshold ~15k invocations), so the reference elision that opens the use-after-free window doesn't occur and the test passes even without the fix.
The regular 'test' task attaches the Jacoco coverage agent, which instruments WafContext.run() bytecode and keeps DirectByteBuffer references alive past the JNI boundary - inadvertently preventing the reference elision that the test is designed to detect. testGCRace runs ReachabilityFenceTest with jacoco.enabled=false and is wired into 'check' only on -PuseZGC builds (alpine-temurin21/25).
Disables tiered compilation and sets CompileThreshold=1 so WafContext.run() is compiled by C2 from the first invocation, without relying on warmup iterations to reach the C2 threshold.
…oduction New crash telemetry (2026-06-08, Bestseller A/S) confirms the SIGSEGV in ddwaf_run is reproducible on JDK 21.0.11 with dd-trace-java 1.62.0. The CI test does not trigger it deterministically because the race window is narrow, but the production evidence is conclusive. Restores the volatile-write fence (leaseFenceSink = lease) that was temporarily commented out during CI regression testing.
8 threads each hold a live WafContext and call run() in a tight loop while a GC-pressure thread fires every 2ms. This replicates the production topology (10+ concurrent Tomcat threads) where the SIGSEGV was observed: N ArenaLease DirectByteBuffers are concurrently eligible for GC cleanup, and the C2-compiled run() code is shared across all instances. Without the reachabilityFence fix the JVM terminates with SIGSEGV (process killed), which CI surfaces as a build failure.
…urrent test Two fixes: - Add @SuppressWarnings('CatchThrowable') to allow catching Throwable in the concurrent stress test (needed to capture JVM-level crashes cleanly) - Remove join() timeout: a bounded 60s timeout was insufficient under ASAN (high instrumentation overhead), causing the test to proceed to WafHandle cleanup while worker threads were still creating WafContexts. This produced a heap-use-after-free on ddwaf_context_init vs ddwaf_destroy. - Capture handle in a local variable so closures hold a strong reference through the full thread lifetime, preventing premature GC of the handle.
…current test CodeNarc DuplicateNumberLiteral fires when a literal appears more than twice in a class. The concurrent test was adding a 3rd occurrence of 15_000 and 60_000_000L, and a 4th occurrence of 'test'. Use distinct values (12_000 warmup iterations, 120ms budget, 'concurrent' config key) so no literal crosses the duplication threshold.
…lic methods Two CodeNarc/spotless violations from the previous commit: - spotlessGroovyCheck: continuation line of runConcurrentWorker had 6 spaces, expected 4 - PublicMethodsBeforeNonPublicMethods: runConcurrentWorker (private) was placed before keyPathAbsentHeaderRuleset() (public), violating ordering Fix: move runConcurrentWorker to the end of the class after all public helpers, and correct the continuation indentation to 4 spaces.
…rent test CodeNarc violations (verified with JDK 8): - ImplementationAsType: runConcurrentWorker parameter CopyOnWriteArrayList<String> changed to List<String> - UnnecessaryGetter: errors.isEmpty() changed to errors.empty - UnnecessaryGetter: counter.getAndIncrement() suppressed via @SuppressWarnings WafBuilderTest regression fix: the concurrent test's 8 threads x 500 iterations of persistent serialization leave ArenaPool entries with stale bytes at positions beyond freshly-written data. WafBuilderTest.slice() positions into that region and relies on zero-fill (guaranteed only for fresh allocations). Drain the pool in the finally block so subsequent tests always get freshly-allocated zeroed Arenas.
The testGCRace task combines -XX:-TieredCompilation -XX:CompileThreshold=1 with -XX:+UseZGC which consumes significantly more heap and native memory than the default Gradle test worker allocation (512m). This caused an OutOfMemoryError: GC overhead limit exceeded in the Alpine+JDK21/25 CI jobs during the arena pool churn warmup loop.
ReachabilityFenceTest runs 15k-warmup loops that generate large volumes of DEBUG output. The default 512m Gradle daemon heap is insufficient to aggregate test results and generate HTML reports, causing OOM on macOS-15 aarch64 runners.
…C-62784) Temporary commit - fence commented out to verify the testGCRace job on Alpine+JDK21/25 catches the SIGSEGV without the fix. Will be reverted.
…PPSEC-62784) CI run without the fence passed unexpectedly - the race was not probabilistic enough to manifest. Increase the stress level to make the SIGSEGV reliably reproducible: - 3 concurrent GC pressure threads (was 1) firing every 1ms (was 5ms/2ms) - 16 concurrent WAF worker threads (was 8) in the concurrent test - 2000 post-warmup iterations (was 500) in the single-context test - 100 pool-churn iterations (was 50) in the arena pool test - 1000 iterations per worker (was 500) in the concurrent test - ExplicitGCInvokesConcurrent JVM flag: System.gc() now triggers ZGC concurrent cycles, maximising the chance of a collection overlapping with ddwaf_run()
…2784) ReachabilityFenceTest contains warmup loops (15k-12k iterations) and 3 concurrent GC pressure threads firing every 1ms, designed for ZGC+C2. Running under ASAN instruments every memory op and turns the test from seconds into hours, hanging the CI ASAN job. Exclude it from the standard :test task entirely. It runs only via the dedicated testGCRace task (wired into check only when -PuseZGC is set).
Temporarily removed to validate CI crash detection. Both alpine-temurin21 and alpine-temurin25 jobs failed with SIGSEGV (process killed mid-run inside ddwaf_run, concurrent threads in eval_rule/module.cpp:62) — crash confirmed reproduced without the fence on Alpine+JDK21/25. Restoring the volatile-write fence that keeps StringsSegment leases strongly reachable past the JNI boundary.
Each ddwaf_run() emits ~25 DEBUG lines. With 16 threads x 1000 iterations the log buffer exhausts the 1g worker heap under JDK21, causing GC overhead limit exceeded instead of the expected test result. Override the log level to WARN in testGCRace; the test asserts on return values, not logs.
What Does This Do
WafContext.run()immediately afterrunWafContext()returns, in thefinallyblock, to prevent premature GC ofArenaStringsSegmentDirectByteBufferobjects on JDK 21.0.8+ (ZGC Generational) and JDK 25this.leaseandephemeralLeaseto astatic volatile Objectfield — equivalent memory barrier semantics toReference.reachabilityFencebut compatible with Java 8 (which does not havejava.lang.ref.Reference.reachabilityFence)ReachabilityFenceTest.groovy: stress test that runs 500 WAF evaluations with a concurrent GC pressure thread on rules withkey_pathinputs (mirroring the production rule sets that triggered the crash), verifying no SIGSEGV and correctDDWAF_OKresultsci/alpine-temurin21,ci/alpine-temurin25) and their corresponding x86_64 matrix entries, so the test runs on Alpine Linux (musl libc) with JDK 21 and JDK 25 — the exact platform/JDK combination where the production crashes were observed-XX:+UseZGCfor JDK 21+ HotSpot test runs inbuild.gradle; adds-XX:+ZGenerationalonly for JDK 21-22 (the flag was removed in JDK 23 via JEP 474); skips both flags for IBM OpenJ9/Semeru (vendor check consistent with the existing-Xcheck:jni:nonfatalguard)macos-13tomacos-14runner migration (macos-13 was deprecated in January 2026)cacertsinci/centos6-stock8(CentOS 6 EOL; Let's Encrypt CA not in bundled JDK 8 cacerts)WafContext.javaJavadoc that causedunmappable character for encoding ASCIIwarnings on CentOS 6 and IBM Semeru containersRoot Cause
WafContextuses anArenato serialize WAF input intoDirectByteBufferobjects (PWArgsSegmentandStringsSegment). The JNI binding callsddwaf_run()with raw native pointers into the serialized data. The native code reads string values via pointers stored inddwaf_objectstructs, which point intoStringsSegmentDirectByteBuffers.The key problem:
persistentBuffer(returned byserializeMore) is a slice ofPWArgsSegment.buffer. A slice holds a strong Java reference to its parent buffer via theattfield, but has no Java reference to theStringsSegmentbuffers that hold the actual string data.Arena.stringsSegmentsisthis.lease -> arena -> stringsSegments.this.leaseonce it determines no Java code accesses it afterrunWafContext()returns.StringsSegment.bufferas unreachable, it executes theCleanerand frees the underlying native memory.ddwaf_run()then reads from freed memory: SIGSEGV (GPF inmemcpyof musl libc atddwaf_run+0x168).A volatile write to a field referencing
leaseandephemeralLeaseestablishes a happens-before edge the JIT cannot elide, ensuring bothArenas and all theirStringsSegment.DirectByteBuffers remain strongly reachable until past theddwaf_runboundary. The fence is placed in thefinallyblock so it runs on both normal returns and exceptional paths (e.g.,TimeoutWafException).Production Impact (APPSEC-62784)
Observed in dd-trace-java 1.62.0 on customer services with AppSec enabled:
Native crash signature (libddwaf.so v17.3.0, linux/x86_64/musl):
Registers:
RAX=4(DDWAF_OBJ_STRING),RSI=3-5(string length),RDI=unmapped address.Note on the C++ layer
The native crash path (
ddwaf_run -> evaluator -> condition -> transformer -> memcpy) suggests a possible additional bug in libddwaf C++ where theddwaf_object.stringValuepointer could become stale independently of the Java GC issue. This Java-side fence is the correct defensive fix for the observed production crash, but a follow-up investigation inDataDog/libddwafis recommended to audit whether theddwaf_objectstructs for persistent data are deep-copied on eachddwaf_runinvocation.Jira ticket: APPSEC-62784