Skip to content

feat(cli): add /compress-fast command for no-LLM rule-based context compression#4892

Closed
ZijianZhang989 wants to merge 2 commits into
mainfrom
feat/compress-fast
Closed

feat(cli): add /compress-fast command for no-LLM rule-based context compression#4892
ZijianZhang989 wants to merge 2 commits into
mainfrom
feat/compress-fast

Conversation

@ZijianZhang989

@ZijianZhang989 ZijianZhang989 commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

What this PR does

Adds /compress-fast, a new slash command that compresses conversation context without any LLM side-query. It combines two rule-based steps: (1) force microcompaction to clear old tool results and media parts while keeping the most recent N, and (2) stripping thought parts from all model turns. The result is a significantly smaller history — typically freeing thousands of tokens — at zero API latency.

A chat_compression checkpoint is written to JSONL so --resume works exactly as it does after /compress.

Why it's needed

/compress relies on an LLM side-query (~2-5s, ~30K tokens) to summarise history. For local model deployments and users who just want quick space reclamation, this is too slow. /compress-fast runs entirely rules-based: no API call, no token cost, instant feedback. It complements /compress — use /compress-fast when you need space right now, and /compress when you want semantic summary quality.

Resolves #4264.

Reviewer Test Plan

How to verify

# Unit tests
npx vitest run \
  packages/core/src/services/microcompaction/microcompact.test.ts \
  packages/core/src/core/geminiChat.test.ts \
  packages/cli/src/ui/commands/compressFastCommand.test.ts \
  packages/cli/src/services/BuiltinCommandLoader.test.ts

Manual smoke test in interactive mode:

npm run build && npm run start
# 1. Ask the model to read files and use tools:
> 帮我看看 package.json 和 tsconfig.json
# 2. Run the fast compress:
/compress-fast
#    → COMPRESSION card shows before/after token counts
# 3. Run again immediately:
/compress-fast
#    → "No compression needed" (nothing left to clean)
# 4. Verify model still works:
> 刚才我们读了哪些文件?
#    → Model responds normally, no tool_use_id errors
# 5. Verify context preserved:
> 这句话之前我们聊了什么?
#    → Model remembers conversation structure (dialogue skeleton intact)
# 6. Verify existing /compress still works:
/compress
#    → LLM compression runs as before
# 7. Verify resume:
qwen-code --resume
#    → Session restores and model responds to follow-ups

Evidence (Before & After)

TUI change: a new COMPRESSION history item appears after running /compress-fast, showing the token reduction (e.g. 15,432 → 8,210). This is identical UX to /compress.

Non-UI artifacts: the JSONL transcript gains a chat_compression record with compressionStatus: COMPRESSED and triggerReason: manual, matching the /compress checkpoint format.

Tested on

OS Status
🍏 macOS
🪟 Windows ⚠️
🐧 Linux ⚠️

Environment (optional)

Local: npm run dev on macOS, Node 22.

Risk & Scope

  • Main risk or tradeoff: Stripping thinking blocks discards the model's internal reasoning. For very long tool-use chains where the model refers back to its own earlier reasoning, this could degrade answer quality. In practice text parts and tool results carry enough visible state; the original /compress is available if deeper summarization is needed.
  • Not validated / out of scope: Performance on extremely large histories (>100K tokens) — the token estimation is fast but estimateContentTokens may have edge cases. The command intentionally does NOT rebuild the session via startChat() — deferred tools survive, unlike a /clear. This is both a feature (fast, preserves state) and a limitation (does not reclaim system prompt tokens).
  • Breaking changes / migration notes: None. All changes are additive. microcompactHistory() gains an optional { force: true } parameter that existing callers don't pass. stripThoughtPartsFromContent remains module-private.

Linked Issues

Closes #4264

中文说明

这个 PR 做了什么

新增 /compress-fast 斜杠命令,在不发起任何 LLM 侧边查询的情况下压缩对话上下文。它组合了两个基于规则的步骤:(1) 强制 microcompaction 清理旧的工具结果和媒体内容,保留最近 N 个;(2) 剥离所有模型回复中的 thought 部分。结果是在零 API 延迟下显著缩减 history token 数。

会写入 chat_compression checkpoint 到 JSONL,--resume 的行为与 /compress 完全一致。

为什么需要

/compress 依赖 LLM 侧边查询来生成摘要(约 2-5 秒,消耗约 30K token)。对于本地模型部署或只想快速释放空间的用户来说太慢了。/compress-fast 纯规则驱动:无 API 调用、无 token 开销、即时响应。它与 /compress 互补——需要立即释放空间时用 /compress-fast,需要语义摘要质量时用 /compress

解决 #4264

Reviewer Test Plan

(测试步骤同上,此处省略以保持可读性。)

风险与范围

  • 主要风险与权衡:剥离 thinking 会丢弃模型的内部推理过程。对于非常长的工具调用链,模型可能忘记调用某个工具的原因进而影响回答质量。实践中文本内容和工具结果已携带足够的可见状态;如需更深层的摘要可使用 /compress
  • 未验证/超出范围:极大 history(>100K tokens)下的性能——token 估算速度很快但 estimateContentTokens 可能存在边界情况。命令有意不使用 startChat() 重建 session——deferred tools 会保留,不像 /clear。这既是优点(快、保留状态)也是局限(无法回收 system prompt token)。
  • Breaking changes / 迁移说明:无。所有修改都是增量式的。microcompactHistory() 新增可选的 { force: true } 参数,现有调用方不传此参数。stripThoughtPartsFromContent 保持模块私有。

关联 Issues

Closes #4264

俊良 added 2 commits June 9, 2026 15:30
…ompression (#4264)

Introduce /compress-fast, a fast rule-based alternative to /compress
that skips the LLM side-query entirely. Two modes:

- --tool-calls (default): strips thinking blocks + force-runs
  microcompaction to clear old tool results and media. Reuses
  surgical FileReadCache disarm from the existing microcompaction
  path.

- --keep-last: keeps only the last user message and last model
  reply. Aggressively trims context and rebuilds via startChat().

Core changes:
- microcompact.ts: add { force: true } option to skip time trigger
- geminiChat.ts: export stripThoughtPartsFromContent, add
  compressFast() method
- client.ts: add tryCompressChatFast() with mode-aware
  post-processing and FileReadCache handling

Closes #4264
…ehavior

--keep-last had a role-detection bug (functionResponse mistaken for user
text), required slow startChat() session rebuild, and lost all deferred
tools. Users can achieve equivalent result with /clear. The single
tool-calls path better matches the 'fast' positioning.
@ZijianZhang989 ZijianZhang989 changed the title feat(cli): add /compress-fast command for no-LLM rule-based context compression (2nd) feat(cli): add /compress-fast command for no-LLM rule-based context compression Jun 9, 2026
@qwen-code-ci-bot

Copy link
Copy Markdown
Collaborator

Thanks for the PR!

Template looks good ✓ — all required sections filled in, bilingual, linked issue.

On direction: /compress-fast fills a real gap. /compress is expensive (~2-5s, ~30K tokens side-query), and for users on local models or anyone wanting instant space reclamation, a rule-based alternative is genuinely useful. Claude Code has /context and "Summarize up to here" in its rewind menu, so context management is clearly a competitive surface. This is well-aligned with qwen-code's mission.

On approach: scope feels right. The two-step recipe (force microcompaction + strip thought parts) is minimal and composes well with existing primitives — microcompactHistory, stripThoughtPartsFromContent, recordChatCompression are all reused, not reinvented. The force: true option on microcompactHistory is a clean, backward-compatible extension. The command mirrors compressCommand.ts structure for familiarity. No over-engineering detected.

One observation worth noting (not a blocker): compressFastCommand.ts duplicates the interactive/non-interactive/ACP boilerplate from compressCommand.ts. Acceptable for clarity and independence, but worth knowing if a third compression variant ever appears.

Moving on to code review. 🔍

中文说明

感谢贡献!

模板完整 ✓ — 所有章节都填写了,双语,关联了 issue。

方向:/compress-fast 填补了真实空白。/compress 开销大(约 2-5 秒,约 30K token 侧查询),对本地模型用户或想即时释放空间的人来说,基于规则的替代方案确实有用。Claude Code 有 /context 和回溯菜单中的"Summarize up to here",上下文管理是明确的竞争面。这与 qwen-code 的使命高度一致。

方案:范围合理。两步组合(强制 microcompaction + 剥离 thought parts)最小化且可组合——复用了 microcompactHistorystripThoughtPartsFromContentrecordChatCompression,没有重新发明轮子。microcompactHistory 上的 force: true 选项是干净的向后兼容扩展。命令结构模仿 compressCommand.ts 以降低认知成本。未检测到过度工程。

一个观察(非阻塞):compressFastCommand.tscompressCommand.ts 在 interactive/non-interactive/ACP 样板代码上有重复。为了清晰和独立性这是可以接受的,但如果将来出现第三种压缩变体则需要重构。

进入代码审查 🔍

Qwen Code · qwen3.7-max

@qwen-code-ci-bot

Copy link
Copy Markdown
Collaborator

Code Review

The implementation is clean and well-disciplined. Key observations:

  • compressFastCommand.ts mirrors compressCommand.ts structure exactly — interactive/non-interactive/ACP modes all handled, pending state managed, error paths covered. No surprises.
  • GeminiChat.compressFast() correctly reuses existing primitives: microcompactHistory({ force: true }), stripThoughtPartsFromContent, recordChatCompression, clearDetailedSpanState. The NOOP check (afterTokens >= beforeTokens) is sound.
  • GeminiClient.tryCompressChatFast() properly handles FileReadCache disarming using the same surgical pattern as tryCompressChat() — stat-based path eviction, fallback to full clear. Sets forceFullIdeContext = true and updates lastPromptTokenCount.
  • microcompactHistory force option is a minimal, backward-compatible extension — when force: true, it skips the time-based trigger but still respects toolResultsNumToKeep.
  • No security concerns. No over-abstraction. No wrong-package placement.

One minor observation: compressFastCommand.ts has some code duplication with compressCommand.ts (the pending/error boilerplate). Not a blocker — the two commands have meaningfully different internals (no abort signal, no custom instructions), and extracting shared helpers would add complexity for two consumers.

Unit Tests

All 223 tests pass across the four affected files:

  • microcompact.test.ts: 37/37 ✅ (including 4 new force option tests)
  • geminiChat.test.ts: 168/168 ✅ (including 4 new compressFast tests)
  • compressFastCommand.test.ts: 7/7 ✅ (new file)
  • BuiltinCommandLoader.test.ts: 11/11 ✅

Real-Scenario Testing (tmux)

Ran npm run dev -- --approval-mode yolo with qwen3.7-max. Tested the complete user journey:

  1. Started a conversation with tool use (ReadFile on package.json)
  2. Ran /compress-fast → compressed 22,055 → 2,868 tokens (87% reduction, instant)
  3. Ran /compress-fast again → "No compression needed."
  4. Asked follow-up question → model correctly remembered package.json
  5. Ran existing /compress → still works (22,063 → 21,206 tokens)
> Read the package.json file and tell me the project name

✦ The user wants me to read the package.json file and tell them the project name.

╭──────────────────────────────────────────────────────────────────────────────╮
│ ✓  ReadFile package.json                                                     │
╰──────────────────────────────────────────────────────────────────────────────╯

✦ The user asked for the project name from package.json. I have the file contents and can see the "name" field clearly.

✦ The project name is @qwen-code/qwen-code.

> /compress-fast

✦ Chat history compressed from 22055 to 2868 tokens.

> /compress-fast

● No compression needed.

> What file did we just read?

✦ The user asks which file we just read. I read the package.json file at the absolute path /home/runner/work/qwen-code/qwen-code/package.json. I should answer concisely.

✦ /home/runner/work/qwen-code/qwen-code/package.json

> /compress

✦ Chat history compressed from 22063 to 21206 tokens.

All scenarios PASS. The feature works as described — instant compression, correct NOOP detection, model continuity preserved, existing /compress unaffected.

中文说明

代码审查

实现干净且有纪律。关键观察:

  • compressFastCommand.ts 完全镜像 compressCommand.ts 结构——interactive/non-interactive/ACP 模式都处理了,pending 状态管理到位,错误路径覆盖完整。没有意外。
  • GeminiChat.compressFast() 正确复用了现有原语:microcompactHistory({ force: true })stripThoughtPartsFromContentrecordChatCompressionclearDetailedSpanState。NOOP 检查(afterTokens >= beforeTokens)逻辑正确。
  • GeminiClient.tryCompressChatFast() 正确处理 FileReadCache 解除——使用与 tryCompressChat() 相同的精确模式:基于 stat 的路径驱逐,回退到完全清除。设置了 forceFullIdeContext = true 并更新 lastPromptTokenCount
  • microcompactHistory force 选项 是最小的向后兼容扩展——force: true 跳过时间触发器但仍遵守 toolResultsNumToKeep
  • 无安全问题。无过度抽象。无错误包放置。

一个小观察:compressFastCommand.tscompressCommand.ts 有一些代码重复(pending/error 样板)。非阻塞——两个命令的内部逻辑有实质差异(无 abort signal,无 custom instructions),提取共享 helper 会为两个消费者增加复杂度。

单元测试

四个受影响文件中全部 223 个测试通过:

  • microcompact.test.ts: 37/37 ✅(包括 4 个新的 force 选项测试)
  • geminiChat.test.ts: 168/168 ✅(包括 4 个新的 compressFast 测试)
  • compressFastCommand.test.ts: 7/7 ✅(新文件)
  • BuiltinCommandLoader.test.ts: 11/11 ✅

真实场景测试(tmux)

使用 qwen3.7-max 运行 npm run dev -- --approval-mode yolo。测试了完整用户旅程:

  1. 开始了带工具使用的对话(ReadFile package.json)
  2. 运行 /compress-fast → 压缩 22,055 → 2,868 tokens(87% 缩减,即时)
  3. 再次运行 /compress-fast → "No compression needed."
  4. 提问后续问题 → 模型正确记住了 package.json
  5. 运行现有 /compress → 仍然正常工作(22,063 → 21,206 tokens)

所有场景通过 ✅。功能如描述所示——即时压缩、正确的 NOOP 检测、模型连续性保持、现有 /compress 不受影响。

Qwen Code · qwen3.7-max

@qwen-code-ci-bot

Copy link
Copy Markdown
Collaborator

Reflection

Stepping back: this PR does exactly what it says, and does it well.

My independent proposal would have been nearly identical — add a compressFast() method that strips thought parts and force-runs microcompaction, wire it to a /compress-fast command, record a checkpoint. The PR matches or exceeds that baseline. The force: true option on microcompactHistory is a cleaner extension point than I would have designed, and the FileReadCache disarming in tryCompressChatFast() shows attention to a detail I might have missed.

The real-scenario test tells the story: 22,055 → 2,868 tokens in zero time, the model still knows what we talked about, and the second run correctly says "nothing left to compress." The existing /compress is untouched.

The code is straightforward — reuses existing primitives, mirrors established patterns, no speculative abstractions. The test coverage is thorough (unit tests for every new path, plus the tmux smoke test confirms end-to-end behavior). The only minor wart is boilerplate duplication between the two compress commands, which is acceptable for two consumers with different internals.

This solves a real user problem (context management is a top friction point), ships cleanly, and doesn't take on unnecessary risk. Ready to merge.

中文说明

退一步看:这个 PR 完全做到了它所承诺的,而且做得很好。

我的独立方案几乎会完全相同——添加一个 compressFast() 方法来剥离 thought parts 并强制运行 microcompaction,连接到一个 /compress-fast 命令,记录 checkpoint。PR 的方案匹配或超越了这个基线。microcompactHistory 上的 force: true 选项比我设计得更干净,tryCompressChatFast() 中的 FileReadCache 解除展示了我可能忽略的细节关注。

真实场景测试说明了一切:22,055 → 2,868 tokens 零延迟,模型仍然知道我们讨论了什么,第二次运行正确提示"nothing left to compress"。现有 /compress 完全不受影响。

代码直截了当——复用现有原语,镜像已建立的模式,没有投机性抽象。测试覆盖全面(每个新路径都有单元测试,加上 tmux 冒烟测试确认端到端行为)。唯一的小瑕疵是两个 compress 命令之间的样板重复,对于两个有不同内部逻辑的消费者来说这是可以接受的。

这解决了一个真实的用户痛点(上下文管理是首要摩擦点),交付干净,没有承担不必要的风险。可以合并。

Qwen Code · qwen3.7-max

@qwen-code-ci-bot qwen-code-ci-bot left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, looks ready to ship. ✅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Requrest: /compress-fast non-AI assisted context reduction

2 participants