Skip to content

fix: add reachabilityFence to prevent premature GC of Arena buffers (APPSEC-62784)#198

Draft
jandro996 wants to merge 26 commits into
masterfrom
fix/reachability-fence-crash-jdk21-jdk25
Draft

fix: add reachabilityFence to prevent premature GC of Arena buffers (APPSEC-62784)#198
jandro996 wants to merge 26 commits into
masterfrom
fix/reachability-fence-crash-jdk21-jdk25

Conversation

@jandro996

@jandro996 jandro996 commented May 12, 2026

Copy link
Copy Markdown
Member

What Does This Do

  • Adds a volatile-write fence in WafContext.run() immediately after runWafContext() returns, in the finally block, to prevent premature GC of Arena StringsSegment DirectByteBuffer objects on JDK 21.0.8+ (ZGC Generational) and JDK 25
  • The fence writes this.lease and ephemeralLease to a static volatile Object field — equivalent memory barrier semantics to Reference.reachabilityFence but compatible with Java 8 (which does not have java.lang.ref.Reference.reachabilityFence)
  • Adds ReachabilityFenceTest.groovy: stress test that runs 500 WAF evaluations with a concurrent GC pressure thread on rules with key_path inputs (mirroring the production rule sets that triggered the crash), verifying no SIGSEGV and correct DDWAF_OK results
  • Adds two new CI Docker images (ci/alpine-temurin21, ci/alpine-temurin25) and their corresponding x86_64 matrix entries, so the test runs on Alpine Linux (musl libc) with JDK 21 and JDK 25 — the exact platform/JDK combination where the production crashes were observed
  • Enables -XX:+UseZGC for JDK 21+ HotSpot test runs in build.gradle; adds -XX:+ZGenerational only for JDK 21-22 (the flag was removed in JDK 23 via JEP 474); skips both flags for IBM OpenJ9/Semeru (vendor check consistent with the existing -Xcheck:jni:nonfatal guard)
  • Fixes macos-13 to macos-14 runner migration (macos-13 was deprecated in January 2026)
  • Fixes stale cacerts in ci/centos6-stock8 (CentOS 6 EOL; Let's Encrypt CA not in bundled JDK 8 cacerts)
  • Removes a UTF-8 em dash from WafContext.java Javadoc that caused unmappable character for encoding ASCII warnings on CentOS 6 and IBM Semeru containers

Root Cause

WafContext uses an Arena to serialize WAF input into DirectByteBuffer objects (PWArgsSegment and StringsSegment). The JNI binding calls ddwaf_run() with raw native pointers into the serialized data. The native code reads string values via pointers stored in ddwaf_object structs, which point into StringsSegment DirectByteBuffers.

The key problem:

  • persistentBuffer (returned by serializeMore) is a slice of PWArgsSegment.buffer. A slice holds a strong Java reference to its parent buffer via the att field, but has no Java reference to the StringsSegment buffers that hold the actual string data.
  • The only strong reference chain to Arena.stringsSegments is this.lease -> arena -> stringsSegments.
  • The JIT compiler in JDK 21.0.8+ (ZGC Generational) and JDK 25 can elide the reference to this.lease once it determines no Java code accesses it after runWafContext() returns.
  • When the concurrent GC sees StringsSegment.buffer as unreachable, it executes the Cleaner and frees the underlying native memory.
  • ddwaf_run() then reads from freed memory: SIGSEGV (GPF in memcpy of musl libc at ddwaf_run+0x168).

A volatile write to a field referencing lease and ephemeralLease establishes a happens-before edge the JIT cannot elide, ensuring both Arenas and all their StringsSegment.DirectByteBuffers remain strongly reachable until past the ddwaf_run boundary. The fence is placed in the finally block so it runs on both normal returns and exceptional paths (e.g., TimeoutWafException).

Production Impact (APPSEC-62784)

Observed in dd-trace-java 1.62.0 on customer services with AppSec enabled:

JDK Crashes (30d) Traffic (2h sample) Crash rate
JDK 25.x 40 64,415 events ~6.2e-4
JDK 21.0.8+ 7 197,337 events ~3.5e-5
JDK 17 0 82,039 events 0
JDK 11 0 39,863 events 0

Native crash signature (libddwaf.so v17.3.0, linux/x86_64/musl):

SIGSEGV TRAPNO=13 (GPF)
  ld-musl-x86_64.so.1 +0x4618a   <- memcpy, src=stale ptr, count=3-5 bytes
  libddwaf.so +0x17cef7           <- virtual dispatch to evaluator/transformer
  libddwaf.so +0x179ce3
  libddwaf.so +0x17bd54
  libddwaf.so +0x16ef39
  ddwaf_run+0x168
  Java_com_datadog_ddwaf_WafContext_runWafContext+0x2bb
  WAFModule$WAFDataCallback.doRunWaf(...)
  GatewayBridge.maybePublishRequestData(...)  <- every HTTP request

Registers: RAX=4 (DDWAF_OBJ_STRING), RSI=3-5 (string length), RDI=unmapped address.

Note on the C++ layer

The native crash path (ddwaf_run -> evaluator -> condition -> transformer -> memcpy) suggests a possible additional bug in libddwaf C++ where the ddwaf_object.stringValue pointer could become stale independently of the Java GC issue. This Java-side fence is the correct defensive fix for the observed production crash, but a follow-up investigation in DataDog/libddwaf is recommended to audit whether the ddwaf_object structs for persistent data are deep-copied on each ddwaf_run invocation.

Jira ticket: APPSEC-62784

@jandro996 jandro996 requested a review from a team as a code owner May 12, 2026 08:43
@jandro996 jandro996 marked this pull request as draft May 12, 2026 08:44

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b26756d35d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/main/java/com/datadog/ddwaf/WafContext.java Outdated
jandro996 added a commit that referenced this pull request May 12, 2026
When runWafContext throws (e.g. TimeoutWafException on budget exhaustion
or a native error), the JIT still has freedom to collect lease/ephemeralLease
before the call completes if the fences are only in the try block. Moving
them into the finally block ensures they run on both normal and exceptional
returns from runWafContext. Addresses review comment on PR #198.
@datadog-datadog-prod-us1-2

datadog-datadog-prod-us1-2 Bot commented May 13, 2026

Copy link
Copy Markdown

🎯 Code Coverage (details)
Patch Coverage: 100.00%
Overall Coverage: 57.30% (+0.02%)

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 960957f | Docs | Datadog PR Page | Give us feedback!

@jandro996 jandro996 requested a review from smola May 22, 2026 09:49
@jandro996 jandro996 marked this pull request as ready for review May 22, 2026 13:19

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dbafb88d55

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread build.gradle Outdated
jandro996 added 5 commits May 25, 2026 12:09
On JDK 21.0.8+ with ZGC Generational and JDK 25, the JIT compiler can
elide references to WafContext.lease and ephemeralLease after the last
Java-visible use, making the Arena's StringsSegment DirectByteBuffers
eligible for GC while ddwaf_run() is still reading native pointers into
them. The concurrent Cleaner thread can then free the underlying native
memory, causing a SIGSEGV (GPF in memcpy of musl libc at ddwaf_run+0x168).

Add Reference.reachabilityFence(this.lease) and
Reference.reachabilityFence(ephemeralLease) immediately after the
runWafContext() JNI call returns, ensuring both Arenas and all their
StringsSegment DirectByteBuffers remain strongly reachable until past
the ddwaf_run boundary.

The bug was observed in production with dd-trace-java 1.62.0 (APPSEC-62784):
- 40 crashes on JDK 25.x, crash rate ~6.2e-4
- 7 crashes on JDK 21.0.8+, crash rate ~3.5e-5
- Zero crashes on JDK 17/11/8 despite significant traffic

The regression test (ReachabilityFenceTest) loads rule sets that match
server.request.headers.no_cookies with key_path values absent from normal
requests (mirroring dd-trace-java 1.62.0 bundled crs-944-140/dog-920-100),
runs 2000 evaluations per test with a concurrent GC pressure thread, and
verifies no crash and correct DDWAF_OK results.

Fixes: APPSEC-62784
When runWafContext throws (e.g. TimeoutWafException on budget exhaustion
or a native error), the JIT still has freedom to collect lease/ephemeralLease
before the call completes if the fences are only in the try block. Moving
them into the finally block ensures they run on both normal and exceptional
returns from runWafContext. Addresses review comment on PR #198.
Reference.reachabilityFence is Java 9+, causing compilation failures
when building with JDK 8 (JMH build and release jar CI jobs). Replace
with a write to a static volatile field, which is a memory barrier the
JIT cannot elide - equivalent semantics, compatible with Java 8.

Also apply spotless formatting to ReachabilityFenceTest.groovy.
- Remove unused imports (JsonSlurper, Executors, TimeUnit)
- Rename buildRuleKeyPathHeadersAbsent -> keyPathAbsentHeaderRuleset
  to avoid FactoryMethodName violation (build.* in non-Builder class)
- Move private helper standardRequestBundle after test methods to fix
  PublicMethodsBeforeNonPublicMethods violations
- Set runBudget = 60s in stress tests so the native ddwaf_run timeout
  is not capped at the 500ms DD_APPSEC_WAF_TIMEOUT default, which is
  too tight under ASAN where native operations are much slower
WafContext.java had a UTF-8 em dash (U+2014) in the class Javadoc that
caused 'unmappable character for encoding ASCII' warnings on CentOS 6
and IBM Semeru containers (LC_ALL=C, ASCII locale).

ReachabilityFenceTest was timing out after 6h on ubuntu2204-semeru17:
- IBM J9 System.gc() is synchronous and slower than HotSpot; a tight
  loop with Thread.sleep(0) overwhelmed the JVM with continuous full GC
- 2000 iterations at ~1s/iter on J9 = hours; reduced to 500
- Pool churn test: 200 -> 50 outer iterations (still 250 WAF runs)
- GC thread sleep: 0ms -> 5ms to avoid saturating the GC scheduler
@jandro996 jandro996 force-pushed the fix/reachability-fence-crash-jdk21-jdk25 branch from 9ca3e73 to 0a991d9 Compare May 25, 2026 10:09
@codecov-commenter

codecov-commenter commented May 25, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 57.66%. Comparing base (396b24f) to head (f282010).
⚠️ Report is 3 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff              @@
##             master     #198      +/-   ##
============================================
+ Coverage     57.55%   57.66%   +0.10%     
  Complexity      175      175              
============================================
  Files            34       34              
  Lines          3645     3647       +2     
  Branches        819      819              
============================================
+ Hits           2098     2103       +5     
+ Misses          973      970       -3     
  Partials        574      574              
Flag Coverage Δ
helper 57.66% <100.00%> (+0.10%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the tests guaranteed to fail when the fix is absent?, sleeping and calling gc are often code smells that the test might not be always testing what it's expected from it.

@jandro996 jandro996 marked this pull request as draft June 6, 2026 18:55
jandro996 added 9 commits June 6, 2026 21:32
Without warmup, WafContext.run() is never compiled by C2 (threshold ~15k
invocations), so the reference elision that opens the use-after-free window
doesn't occur and the test passes even without the fix.
The regular 'test' task attaches the Jacoco coverage agent, which instruments
WafContext.run() bytecode and keeps DirectByteBuffer references alive past the
JNI boundary - inadvertently preventing the reference elision that the test is
designed to detect. testGCRace runs ReachabilityFenceTest with jacoco.enabled=false
and is wired into 'check' only on -PuseZGC builds (alpine-temurin21/25).
Disables tiered compilation and sets CompileThreshold=1 so WafContext.run()
is compiled by C2 from the first invocation, without relying on warmup
iterations to reach the C2 threshold.
…oduction

New crash telemetry (2026-06-08, Bestseller A/S) confirms the SIGSEGV
in ddwaf_run is reproducible on JDK 21.0.11 with dd-trace-java 1.62.0.
The CI test does not trigger it deterministically because the race window
is narrow, but the production evidence is conclusive.

Restores the volatile-write fence (leaseFenceSink = lease) that was
temporarily commented out during CI regression testing.
8 threads each hold a live WafContext and call run() in a tight loop
while a GC-pressure thread fires every 2ms. This replicates the
production topology (10+ concurrent Tomcat threads) where the SIGSEGV
was observed: N ArenaLease DirectByteBuffers are concurrently eligible
for GC cleanup, and the C2-compiled run() code is shared across all
instances.

Without the reachabilityFence fix the JVM terminates with SIGSEGV
(process killed), which CI surfaces as a build failure.
…urrent test

Two fixes:
- Add @SuppressWarnings('CatchThrowable') to allow catching Throwable in the
  concurrent stress test (needed to capture JVM-level crashes cleanly)
- Remove join() timeout: a bounded 60s timeout was insufficient under ASAN
  (high instrumentation overhead), causing the test to proceed to WafHandle
  cleanup while worker threads were still creating WafContexts. This produced
  a heap-use-after-free on ddwaf_context_init vs ddwaf_destroy.
- Capture handle in a local variable so closures hold a strong reference
  through the full thread lifetime, preventing premature GC of the handle.
…current test

CodeNarc DuplicateNumberLiteral fires when a literal appears more than
twice in a class. The concurrent test was adding a 3rd occurrence of
15_000 and 60_000_000L, and a 4th occurrence of 'test'.

Use distinct values (12_000 warmup iterations, 120ms budget, 'concurrent'
config key) so no literal crosses the duplication threshold.
…lic methods

Two CodeNarc/spotless violations from the previous commit:
- spotlessGroovyCheck: continuation line of runConcurrentWorker had 6
  spaces, expected 4
- PublicMethodsBeforeNonPublicMethods: runConcurrentWorker (private) was
  placed before keyPathAbsentHeaderRuleset() (public), violating ordering

Fix: move runConcurrentWorker to the end of the class after all public
helpers, and correct the continuation indentation to 4 spaces.
jandro996 added 11 commits June 8, 2026 16:36
…rent test

CodeNarc violations (verified with JDK 8):
- ImplementationAsType: runConcurrentWorker parameter CopyOnWriteArrayList<String>
  changed to List<String>
- UnnecessaryGetter: errors.isEmpty() changed to errors.empty
- UnnecessaryGetter: counter.getAndIncrement() suppressed via @SuppressWarnings

WafBuilderTest regression fix: the concurrent test's 8 threads x 500 iterations of
persistent serialization leave ArenaPool entries with stale bytes at positions beyond
freshly-written data. WafBuilderTest.slice() positions into that region and relies on
zero-fill (guaranteed only for fresh allocations). Drain the pool in the finally block
so subsequent tests always get freshly-allocated zeroed Arenas.
The testGCRace task combines -XX:-TieredCompilation -XX:CompileThreshold=1
with -XX:+UseZGC which consumes significantly more heap and native memory
than the default Gradle test worker allocation (512m). This caused an
OutOfMemoryError: GC overhead limit exceeded in the Alpine+JDK21/25 CI jobs
during the arena pool churn warmup loop.
ReachabilityFenceTest runs 15k-warmup loops that generate large volumes of
DEBUG output. The default 512m Gradle daemon heap is insufficient to
aggregate test results and generate HTML reports, causing OOM on macOS-15
aarch64 runners.
…C-62784)

Temporary commit - fence commented out to verify the testGCRace job on
Alpine+JDK21/25 catches the SIGSEGV without the fix. Will be reverted.
…PPSEC-62784)

CI run without the fence passed unexpectedly - the race was not probabilistic
enough to manifest. Increase the stress level to make the SIGSEGV reliably
reproducible:

- 3 concurrent GC pressure threads (was 1) firing every 1ms (was 5ms/2ms)
- 16 concurrent WAF worker threads (was 8) in the concurrent test
- 2000 post-warmup iterations (was 500) in the single-context test
- 100 pool-churn iterations (was 50) in the arena pool test
- 1000 iterations per worker (was 500) in the concurrent test
- ExplicitGCInvokesConcurrent JVM flag: System.gc() now triggers ZGC
  concurrent cycles, maximising the chance of a collection overlapping
  with ddwaf_run()
…2784)

ReachabilityFenceTest contains warmup loops (15k-12k iterations) and 3
concurrent GC pressure threads firing every 1ms, designed for ZGC+C2.
Running under ASAN instruments every memory op and turns the test from
seconds into hours, hanging the CI ASAN job.

Exclude it from the standard :test task entirely. It runs only via the
dedicated testGCRace task (wired into check only when -PuseZGC is set).
Temporarily removed to validate CI crash detection. Both alpine-temurin21
and alpine-temurin25 jobs failed with SIGSEGV (process killed mid-run
inside ddwaf_run, concurrent threads in eval_rule/module.cpp:62) — crash
confirmed reproduced without the fence on Alpine+JDK21/25.

Restoring the volatile-write fence that keeps StringsSegment leases
strongly reachable past the JNI boundary.
Each ddwaf_run() emits ~25 DEBUG lines. With 16 threads x 1000 iterations
the log buffer exhausts the 1g worker heap under JDK21, causing
GC overhead limit exceeded instead of the expected test result.

Override the log level to WARN in testGCRace; the test asserts on return
values, not logs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants