Skip to content

Defer Message Action Should Have an Attribute as a Backstop#4028

Merged
iancooper merged 13 commits into
masterfrom
error_examples
Feb 23, 2026
Merged

Defer Message Action Should Have an Attribute as a Backstop#4028
iancooper merged 13 commits into
masterfrom
error_examples

Conversation

@iancooper

Copy link
Copy Markdown
Member

Allows Defer Message Action to match the pattern of having a backstop attribute. Ensures that we pass the defer period to requeue, and overwrite the subscription configured requeue

iancooper and others added 13 commits February 23, 2026 09:59
Mark ADR 0051 (error-handling-examples) as Accepted and create
design-approved marker for spec 0021-Error-Examples.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The sample intentionally throws on every 5th message. Without resetting
the unacceptable message count, the pump would shut down. Setting the
window to TimeSpan.Zero resets the count each pump cycle.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Four-project sample demonstrating RejectMessageOnErrorAsync on RabbitMQ:
- Greetings library with handler that rejects every 5th message
- GreetingsSender with TimedMessageGenerator IHostedService
- GreetingsReceiverConsole with RmqSubscription and DLQ routing
- DlqConsole consuming rejected messages with rejection metadata

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Align DeferMessageAction with RejectMessageAction and DontAckAction by
adding standard exception constructors and a TimeSpan? Delay property.
This structural change prepares for the DeferMessageOnError backstop
handler without altering existing behavior.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Three-project sample demonstrating DeferMessageAction on Kafka:
- Handler throws DeferMessageAction directly (no attribute yet)
- ConcurrentDictionary tracks retries per message ID
- Every 3rd message defers twice then succeeds on 3rd attempt
- Unique groupId kafka-DeferOnError-Sample avoids consumer collisions

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
DeferMessageOnErrorAttribute decorates a handler method to catch
unhandled exceptions and convert them to DeferMessageAction with a
configurable delay. The delay flows from the attribute through
InitializerParams to the handler.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Async counterpart to DeferMessageOnErrorHandler. Catches unhandled
exceptions in async pipelines and converts them to DeferMessageAction
with a configurable delay from the attribute.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Three-project sample demonstrating DeferMessageAction on RabbitMQ:
- Handler throws DeferMessageAction directly (no attribute yet)
- ConcurrentDictionary tracks retries per message ID
- Every 3rd message defers twice then succeeds on 3rd attempt
- InMemorySchedulerFactory provides requeue delay support

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Verifies DeferMessageOnErrorHandler passes through transparently
when the inner handler completes without throwing an exception.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Verifies DeferMessageOnErrorHandlerAsync passes through transparently
when the inner handler completes without throwing an exception.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Verifies both sync and async attributes return correct handler types,
timing, step values, and InitializerParams containing the delay.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Both Proactor and Reactor now capture DeferMessageAction.Delay
when catching the exception and pass it to RequeueMessage, which
falls back to RequeueDelay when no delay is specified.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tted)

Three-project sample demonstrating DontAckOnErrorAsync on Kafka:
- Handler uses [DontAckOnErrorAsync(step: 0)], throws every 5th message
- On Kafka, not committing the offset causes re-delivery on next poll
- Unique groupId kafka-DontAckOnError-Sample avoids consumer collisions
- README contrasts Kafka nack (no-op) with RabbitMQ nack (BasicNack)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

@codescene-delta-analysis codescene-delta-analysis Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gates Passed
4 Quality Gates Passed

See analysis details in CodeScene

Quality Gate Profile: Clean Code Collective
Install CodeScene MCP: safeguard and uplift AI-generated code. Catch issues early with our IDE extension and CLI tool.

@iancooper iancooper merged commit 2b39fe7 into master Feb 23, 2026
7 of 8 checks passed
@claude

claude Bot commented Feb 23, 2026

Copy link
Copy Markdown

test comment from agent - please ignore

@claude

claude Bot commented Feb 23, 2026

Copy link
Copy Markdown

Pull Request: Fix Unreliable CI Acceptance Tests

Overview

This PR addresses the reliability issues in Brighter's CI acceptance tests for MessagingGateways, Inboxes, and Outboxes. The tests were exhibiting erratic behavior in GitHub Actions, often failing due to timing-related issues that were difficult to reproduce locally.

Problem

The CI build (.github/workflows/ci.yml) runs acceptance tests that require various middleware (Kafka, RabbitMQ, Redis, MQTT) and databases (PostgreSQL, MySQL, SQL Server, MongoDB, DynamoDB). These tests frequently failed in CI but worked locally, suggesting infrastructure timing issues rather than code bugs.

Root Causes

  1. Missing Service Health Checks: Many services lacked health checks, causing tests to start before services were ready
  2. Inadequate Retry Counts: Health checks had too few retries (3-5) for CI environment variability
  3. Kafka Blind Wait: A fixed 30-second sleep instead of active readiness verification
  4. Short Job Timeouts: 5-minute timeouts were insufficient for slower CI environments
  5. No Documentation: No record of known issues or improvement strategies

Solution

1. Comprehensive Service Health Checks

Added or improved health checks for all services in .github/workflows/ci.yml:

Service Health Check Method Retries Max Wait
Kafka kafka-broker-api-versions 15 ~150s
Zookeeper nc localhost 2181 check 10 ~100s
Schema Registry HTTP endpoint 10 ~100s
RabbitMQ rabbitmqctl node_health_check 10 ↑ ~100s
Redis redis-cli ping 10 ↑ ~100s
PostgreSQL pg_isready 10 ↑ ~100s
MySQL/MariaDB healthcheck.sh 10 ↑ ~100s
SQL Server sqlcmd query 10 ~100s
MongoDB mongosh ping 15 ↑ ~300s
MQTT mosquitto_sub 10 ~100s
DynamoDB HTTP endpoint 10 ~100s

(↑ indicates increased from previous value)

2. Active Kafka Readiness Verification

Before:

- name: Sleep to let Kafka spin up
  uses: jakejarvis/wait-action@master
  with:
    time: '30s'

After:

- name: Wait for Kafka to be ready
  run: |
    max_attempts=30
    while [ $attempt -lt $max_attempts ]; do
      if kafkacat -b localhost:9092 -L > /dev/null 2>&1; then
        echo "Kafka is ready!"
        break
      fi
      sleep 2
    done

Benefits:

  • Active verification instead of blind waiting
  • Can complete in <30s if Kafka is ready quickly
  • Can wait up to 60s if needed
  • Fails fast with clear error message
  • Better troubleshooting through logging

3. Increased Job Timeouts

All acceptance test jobs increased from 5 minutes → 8 minutes:

  • redis-ci, mqtt-ci, rabbitmq-ci
  • postgres-ci, sqlserver-ci, mysql-ci
  • dynamo-ci, localstack-ci, mongodb-ci
  • aws-ci, aws-scheduler-ci, azure-ci
  • sqlite-ci, gcp-ci

Note: kafka-ci already had 20 minutes (unchanged)

4. Comprehensive Documentation

Added two documentation files:

  1. docs/CI-Test-Reliability.md (171 lines)

    • Detailed analysis of issues
    • Explanation of all improvements
    • Recommendations for future work
    • Success metrics to track
  2. RELIABILITY-IMPROVEMENTS-SUMMARY.md (140 lines)

    • Executive summary
    • Quick reference tables
    • Validation and rollback plans
    • Next steps

Impact

Expected Improvements

Service Startup Reliability

  • Services verified ready before tests run
  • Longer retry windows accommodate CI variability

Kafka Test Reliability

  • Active readiness check replaces blind wait
  • Better handling of slow Kafka initialization

Test Execution Success

  • 8-minute timeouts prevent premature cancellation
  • Tests have adequate time to complete

Maintainability

  • Documentation preserves knowledge
  • Clear patterns for future improvements

Tests Expected to Improve

  • Kafka tests (13 marked [Trait("Fragile", "CI")])
  • RabbitMQ tests (6 marked fragile)
  • Database initialization tests
  • Message broker timing tests

Testing & Validation

Pre-merge Validation

✅ CI workflow YAML syntax validated
✅ All changes backward compatible
✅ No test or application code modified
✅ Changes isolated to infrastructure

Post-merge Monitoring

Recommended monitoring approach:

  1. Track Pass Rates: Monitor CI success over 10+ builds
  2. Measure Duration: Check if 8-minute timeouts are adequate
  3. Review Logs: Verify Kafka readiness checks succeed
  4. Re-enable Tests: Gradually remove "Fragile" trait from stable tests
  5. Optimize Timeouts: Adjust based on observed durations

Success Metrics

Metric Current Target
CI Pass Rate TBD >95%
False Positive Rate TBD <5%
Kafka Startup Time 30s (blind) 10-30s (actual)
Job Duration (P95) TBD <7 min

Rollback Plan

If issues arise:

  1. Revert the 4 commits from this PR
  2. All changes are in CI config and docs only
  3. No code changes require rollback
  4. Previous behavior fully restored

Future Work

Short Term

  • Monitor CI success rates
  • Remove "Fragile" trait from stable tests
  • Optimize timeout values based on data

Medium Term

  • Implement retry helpers in tests
  • Add test-level readiness checks
  • Use environment variables for CI-specific timeouts

Long Term

  • Improve test fixtures with proper initialization
  • Re-evaluate test parallelization restrictions
  • Create CI-specific test configuration profiles

Files Changed

.github/workflows/ci.yml             | 94 insertions(+), 26 deletions(-)
docs/CI-Test-Reliability.md          | 171 new file
RELIABILITY-IMPROVEMENTS-SUMMARY.md  | 140 new file

Total: 3 files changed, 379 insertions(+), 26 deletions(-)

Related Issues

  • [Chore] Unreliable Acceptance Tests - Original issue

Commits

  1. 6d99514 - Improve CI service health checks and remove fixed delays
  2. c4b8f5c - Increase CI job timeouts to accommodate slower test environments
  3. b3440bd - Add documentation for CI test reliability improvements
  4. b1dea5e - Add comprehensive summary of reliability improvements

Reviewer Notes

Key Changes to Review

  1. Health Check Additions: Verify health check commands are appropriate
  2. Kafka Readiness Script: Review the bash logic for correctness
  3. Timeout Values: Confirm 8 minutes is reasonable but not excessive
  4. Documentation: Ensure documentation is accurate and helpful

Questions to Consider

  • Are the health check retry counts adequate?
  • Is the Kafka readiness script robust enough?
  • Should any tests be re-enabled immediately?
  • Are there other services that need health checks?

Conclusion

This PR makes infrastructure-only changes to improve CI test reliability without modifying any test or application code. The changes are conservative, well-documented, and easily reversible. The improvements address root causes rather than symptoms, providing a foundation for long-term reliability.

@claude

claude Bot commented Feb 23, 2026

Copy link
Copy Markdown

Allows Defer Message Action to match the pattern of having a backstop attribute. Ensures that we pass the defer period to requeue, and overwrite the subscription configured requeue

@claude

claude Bot commented Feb 23, 2026

Copy link
Copy Markdown

PR Review: Defer Message Action Attribute Backstop

This PR adds a DeferMessageOnError[Async]Attribute backstop pattern to complement the existing RejectMessageOnError and DontAckOnError attributes, and extends DeferMessageAction to carry a custom per-throw delay. Overall the design is well-executed and follows established Brighter conventions. I have a few issues worth addressing before merging.


Core Framework Changes ✓

The changes to DeferMessageAction, Proactor.cs, and Reactor.cs are clean:

  • Capturing the DeferMessageAction instance (instead of a bool flag) is the right approach.
  • The delay ?? RequeueDelay fallback in RequeueMessage correctly preserves existing behaviour when no custom delay is specified.
  • The whitespace-only cleanup in RequeueMessage mixes structural and behavioural changes in a single diff (minor).

Bug: Backstop Handlers Swallow a Directly-Thrown DeferMessageAction

DeferMessageOnErrorHandler.cs and DeferMessageOnErrorHandlerAsync.cs

Both backstop handlers have:

catch (Exception ex)
{
    throw new DeferMessageAction(ex.Message, ex, _delayMilliseconds);
}

If a handler already throws DeferMessageAction directly (the pre-attribute pattern shown in the defer samples), the backstop will:

  1. Catch it
  2. Wrap it in a new DeferMessageAction — losing the delay the handler specified
  3. Use the attribute's _delayMilliseconds instead

This means mixing the two patterns gives silently incorrect behaviour. Suggest adding a pass-through:

catch (DeferMessageAction)
{
    throw; // preserve the caller's delay
}
catch (Exception ex)
{
    throw new DeferMessageAction(ex.Message, ex, _delayMilliseconds);
}

Bug: OperationCanceledException Is Treated as a Defer Trigger

Both backstop handlers have a bare catch (Exception ex) with no special handling for cancellation. If the host is shutting down and a CancellationToken fires mid-handler, the caught OperationCanceledException gets converted to a DeferMessageAction and the message is requeued. Cancellation should propagate normally:

catch (OperationCanceledException)
{
    throw;
}
catch (Exception ex)
{
    throw new DeferMessageAction(ex.Message, ex, _delayMilliseconds);
}

There is no test covering this scenario.


Sample Logic Is Fragile: count % DEFER_EVERY_NTH After Re-delivery

KafkaDeferOnError and RMQDeferOnError handler

var count = Interlocked.Increment(ref _messageCount); // monotonically increasing
var retryCount = s_retryTracker.AddOrUpdate(messageId, 0, (_, existing) => existing + 1);

if (count % DEFER_EVERY_NTH == 0 && retryCount < MAX_DEFERRALS)
{
    throw new DeferMessageAction();
}

count is assigned at receipt time, not tied to message identity. When a deferred message is re-delivered, it receives a new count value. Whether count % 3 == 0 holds again depends on how many other messages were processed in between — making the sample non-deterministic and the expected output in README.md (showing message #3 deferring twice then succeeding) potentially inaccurate.

Using retryCount alone to decide whether to defer is simpler and correct:

if (retryCount < MAX_DEFERRALS)
{
    throw new DeferMessageAction();
}

Minor: TimedMessageGenerator._iteration Is Not Thread-Safe

private void DoWork(object? state)
{
    _iteration++;   // Timer callback runs on thread pool

_iteration++ is not atomic. Interlocked.Increment is used correctly for _messageCount in the handler — use the same pattern here.


Minor: Redundant Id Assignment in Sender

In GreetingsSender/TimedMessageGenerator.cs, the GreetingEvent initialiser sets Id = Id.Random() explicitly, but GreetingEvent already calls base(Id.Random()) in its constructor — so the initialiser overwrites the constructor's Id with a second random value and the first is silently discarded. Drop Id = Id.Random() from the object initialiser.


Minor: Missing Convenience Constructor on DeferMessageAction

There is no DeferMessageAction(string? reason, int delayMilliseconds) overload (without an inner exception). To specify a delay and a reason from a clean throw site, callers must pass null explicitly:

throw new DeferMessageAction("retry later", null, 5000);

A two-parameter overload would be cleaner to use.


Observation: No Equivalent Integration Test for Reactor

The new integration test covers the Proactor (async pump). There is no equivalent for Reactor (sync pump), even though Reactor.cs received identical changes. Worth adding for symmetry.


Positive Notes

  • The attribute/handler structure mirrors RejectMessageOnError and DontAckOnError exactly — consistent and easy to follow.
  • The pump changes are minimal and surgical.
  • The sample READMEs are thorough and explain transport-specific differences (Kafka offset vs RabbitMQ nack).
  • The unacceptableMessageLimitWindow: TimeSpan.Zero addition on the existing Kafka DLQ samples is a welcome improvement.
  • Test coverage of attribute → handler wiring and delay propagation is good.

Summary: The two behavioural bugs (swallowing DeferMessageAction thrown directly from a handler, and converting OperationCanceledException to a defer) should be addressed before merge. The sample fragility and thread-safety issues are lower priority but worth fixing since samples are the primary learning resource for this feature.

DevJonny pushed a commit to DevJonny/Brighter that referenced this pull request Feb 28, 2026
…Command#4028)

* Accept ADR 0051 for error handling example applications

Mark ADR 0051 (error-handling-examples) as Accepted and create
design-approved marker for spec 0021-Error-Examples.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: add unacceptableMessageLimitWindow to KafkaTaskQueueWithDLQ sample

The sample intentionally throws on every 5th message. Without resetting
the unacceptable message count, the pump would shut down. Setting the
window to TimeSpan.Zero resets the count each pump cycle.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: add RMQTaskQueueWithDLQ sample for RabbitMQ RejectOnError with DLQ

Four-project sample demonstrating RejectMessageOnErrorAsync on RabbitMQ:
- Greetings library with handler that rejects every 5th message
- GreetingsSender with TimedMessageGenerator IHostedService
- GreetingsReceiverConsole with RmqSubscription and DLQ routing
- DlqConsole consuming rejected messages with rejection metadata

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* tidy: add constructors and Delay property to DeferMessageAction

Align DeferMessageAction with RejectMessageAction and DontAckAction by
adding standard exception constructors and a TimeSpan? Delay property.
This structural change prepares for the DeferMessageOnError backstop
handler without altering existing behavior.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: add KafkaDeferOnError sample for Kafka defer with retry counter

Three-project sample demonstrating DeferMessageAction on Kafka:
- Handler throws DeferMessageAction directly (no attribute yet)
- ConcurrentDictionary tracks retries per message ID
- Every 3rd message defers twice then succeeds on 3rd attempt
- Unique groupId kafka-DeferOnError-Sample avoids consumer collisions

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: add sync DeferMessageOnErrorHandler and attribute

DeferMessageOnErrorAttribute decorates a handler method to catch
unhandled exceptions and convert them to DeferMessageAction with a
configurable delay. The delay flows from the attribute through
InitializerParams to the handler.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: add async DeferMessageOnErrorHandlerAsync and attribute

Async counterpart to DeferMessageOnErrorHandler. Catches unhandled
exceptions in async pipelines and converts them to DeferMessageAction
with a configurable delay from the attribute.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: add RMQDeferOnError sample for RabbitMQ defer with retry counter

Three-project sample demonstrating DeferMessageAction on RabbitMQ:
- Handler throws DeferMessageAction directly (no attribute yet)
- ConcurrentDictionary tracks retries per message ID
- Every 3rd message defers twice then succeeds on 3rd attempt
- InMemorySchedulerFactory provides requeue delay support

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* test: add sync defer handler success path characterization test

Verifies DeferMessageOnErrorHandler passes through transparently
when the inner handler completes without throwing an exception.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* test: add async defer handler success path characterization test

Verifies DeferMessageOnErrorHandlerAsync passes through transparently
when the inner handler completes without throwing an exception.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* test: add attribute configuration tests for DeferMessageOnError

Verifies both sync and async attributes return correct handler types,
timing, step values, and InitializerParams containing the delay.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: pump extracts delay from DeferMessageAction and passes to requeue

Both Proactor and Reactor now capture DeferMessageAction.Delay
when catching the exception and pass it to RequeueMessage, which
falls back to RequeueDelay when no delay is specified.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: add KafkaDontAckOnError sample for Kafka nack (offset not committed)

Three-project sample demonstrating DontAckOnErrorAsync on Kafka:
- Handler uses [DontAckOnErrorAsync(step: 0)], throws every 5th message
- On Kafka, not committing the offset causes re-delivery on next poll
- Unique groupId kafka-DontAckOnError-Sample avoids consumer collisions
- README contrasts Kafka nack (no-op) with RabbitMQ nack (BasicNack)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant