Skip to content

op-node/derive: malformed span batch tx data propagates as unclassified error causing O(N×backoff) drain #19494

@sebastianst

Description

@sebastianst

Problem

channel_in_reader.go:118-124 does not catch errors from DeriveSpanBatch:

case SpanBatchType:
    // ...
    batch.Batch, err = DeriveSpanBatch(batchData, cr.cfg.BlockTime, cr.cfg.Genesis.L2Time, cr.cfg.L2ChainID)
    if err != nil {
        return nil, err  // unclassified error propagates up
    }

DeriveSpanBatch calls RawSpanBatch.ToSpanBatchderive(), which reconstructs full transactions from the span batch encoding. This can fail with unclassified errors (plain fmt.Errorf/errors.New) when the batch data is malformed:

  • recoverV: "invalid tx type: %d" for unknown transaction types
  • fullTxs: stx.UnmarshalBinary failure, "tx to not enough", stx.convertToFullTx failure

These are all external input failures (bad data from the batcher), not logic errors. The one NewCriticalError in DeriveSpanBatch (for the *RawSpanBatch type assertion) is correctly classified as a logic error and should continue to propagate.

Why it matters

Unclassified errors reach PipelineDeriver.OnEvent's catch-all:

} else if err != nil {
    d.pipeline.log.Error("Derivation process error", "err", err)
    d.emitter.Emit(ctx, rollup.EngineTemporaryErrorEvent{Err: err})
}

This causes:

  1. Error-level log — misleading for operators; this is bad batcher data, not an infrastructure failure
  2. Backoff + retry — since cr.nextBatchFn is still set, retry reads the next batch (the bad one was already consumed). A channel with N malformed span batches requires N backoff cycles to drain.

A batcher with a valid key can exploit this by submitting channels full of span batches with deliberately malformed transaction data (e.g. an unknown tx type byte), causing O(N × backoff_duration) stall per channel.

Fix

Catch non-critical errors in ChannelInReader.NextBatch and treat them as a drop:

batch.Batch, err = DeriveSpanBatch(batchData, cr.cfg.BlockTime, cr.cfg.Genesis.L2Time, cr.cfg.L2ChainID)
if err != nil {
    if errors.Is(err, ErrCritical) {
        return nil, err // logic error, propagate
    }
    cr.log.Warn("dropping malformed span batch", "err", err)
    return nil, NotEnoughData
}

NotEnoughData causes immediate retry with no backoff, draining bad channels at full speed.

Parent issue

Part of #19491


Generated by Claude

Metadata

Metadata

Assignees

No one assigned

    Labels

    M-needs-triageMeta: this issue needs to be labelled

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions