fix: 限制语法模型全局调频,仅提拔最优解以保护候选词序#1176
Closed
amzxyz wants to merge 3 commits into
Closed
Conversation
d47352d to
c32956f
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
【背景与痛点】
引入语法模型(N-gram)的核心初衷是为了提升长句输入的连贯性与准确度。然而,由于 N-gram 模型不可避免地包含长尾数据,且其切词结构无法像人工词库那样严谨,这就导致了一个灾难性的体验退化:
如果放任语法模型对所有候选词进行全局上下文重排,即使模型成功命中了首位最优解,后续的候选列表也会发生严重的“大洗牌”。这彻底破坏了用户长期形成的“在第几页第几位找词”的肌肉记忆与输入节奏。这也是为什么社区长期以来只能无奈建议用户“保留长句预测,关闭上下文调频”的根源所在。
【改进依据与核心逻辑】
本次 PR 旨在从根本上解决这一冲突,将上下文调频严格限制为“仅提拔最优解,其余候选保持原序”。强烈建议通过本次 PR,主要基于以下三个维度的考量:
遵循“最小侵入性”原则(捍卫用户体验)
各个输入方案本身已有极其成熟的默认排序。对用户而言,不同方案排序虽有差异,但特定语境下的“最优解”往往殊途同归。如果为了一个聪明的首选项,而牺牲掉整个后续列表的稳定性,打乱用户的输入节奏,这无疑是“得不偿失”的。本次修改实现了精准干预:模型只负责交出最聪明的第一名,剩下的排序权利交还回词库与用户记忆。
正视模型制约与数据维度的客观限制
在面对极其庞大且长尾的输入数据量时,模型客观上无法保证对批量且低频的候选词组做出 100% 正确且符合常理的调频排序。既然无法保证全盘正确,就应当避免让底层模型对下位候选词进行无效且添乱的“算分”。
解决“单字输入”的重灾区问题
在原本的全局调频机制下,“单字”候选是发生排序逆转的绝对重灾区。相比之下,长词组(如 2+2 结构)因为受到长度和特定组合的强约束,稳定性天然较高。因此,叫停全局调频,能立刻挽救单字输入的严重乱序现象,恢复单字的输入体验。
【总结】
综上所述,让语法模型专职负责“推举唯一最优解”,放弃对全局候选的干涉,是平衡“长句智能化”与“基础词库稳定性”的最优解。此方案兼顾了技术的可行性与用户体验的连贯性,建议同意合并本次 PR。