Skip to content

Zero token counts for empty transcripts and README alignment#43

Merged
gistrec merged 4 commits intomainfrom
codex/add-llm_tokens_by_model-column-to-transcriptionhistory
Mar 12, 2026
Merged

Zero token counts for empty transcripts and README alignment#43
gistrec merged 4 commits intomainfrom
codex/add-llm_tokens_by_model-column-to-transcriptionhistory

Conversation

@gistrec
Copy link
Copy Markdown
Owner

@gistrec gistrec commented Jan 20, 2026

Motivation

  • Avoid counting tokens for the user-facing fallback text when transcription is empty and ensure zero token counts are recorded for empty transcriptions.
  • Keep stored text user-friendly while basing token accounting on the raw recognition output.
  • Fix README schema formatting so the llm_tokens_by_model column aligns with other columns for readability.

Description

  • Update utils/tokens.py so tokens_by_model returns zeros for every model when text.strip() is empty and keep LLM_TOKEN_MODELS-based mapping.
  • Change schedulers/transcription.py to compute token_counts = tokens_by_model(raw_text) using the raw parse_text(result) output and then replace empty text with the friendly fallback string before persisting results.
  • Persist llm_tokens_by_model=token_counts on both successful and failed updates in update_transcription calls.
  • Adjust README.md spacing for the llm_tokens_by_model JSON column to align with other schema columns.

Testing

  • No automated tests were run for this change.
  • Local static inspection and manual review of modified files were performed during the rollout and changes were committed successfully.

Codex Task

Comment thread utils/tokens.py Outdated
Comment thread utils/tokens.py

text = parse_text(result)
if not text.strip():
token_counts = tokens_by_model(text)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: When an S3 upload fails during transcription, the call to update_transcription omits the llm_tokens_by_encoding parameter, preventing the token count from being saved for the failed task.
Severity: MEDIUM

Suggested Fix

Modify the update_transcription call within the if s3_uri is None: block to include the llm_tokens_by_encoding=token_counts argument. This will ensure token counts are persisted consistently across all failure scenarios, aligning with the behavior of other error-handling paths.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.

Location: schedulers/transcription.py#L90

Potential issue: In the transcription scheduler, token counts are calculated and stored
in the `token_counts` variable. If the subsequent S3 upload fails, `s3_uri` will be
`None`, triggering a failure path. In this specific failure case, the call to
`update_transcription` on line 102 omits the `llm_tokens_by_encoding=token_counts`
argument. This contradicts the logic in other success and failure paths where the token
counts are correctly passed. As a result, when a transcription fails due to an S3 upload
issue, the token count information for that task is lost instead of being persisted with
the 'failed' status.

@gistrec gistrec merged commit 5b896d6 into main Mar 12, 2026
1 check passed
@gistrec gistrec deleted the codex/add-llm_tokens_by_model-column-to-transcriptionhistory branch March 12, 2026 22:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant