What is the problem?
The GitHub CLI team is using gh models to help detect whether a newly created cli/cli issue is spammy or not as part of cli/cli#11316.
In order to help evaluate whether our system prompts can accurately determine if an issue is spammy, a standalone eval.sh script was created to assess a number of scenarios we've seen in cli/cli issues.
When rate limits are reached, gh models does not gracefully handle the 429 Too Many Requests response from the API:
Running evaluation: Evaluate spam detection
Description:
Model: openai/gpt-4o-mini
Test cases: 273
Running test case 1/273...
✓ PASSED
✓ assert response (score: 1.00)
Expected exact match: 'PASS'
Running test case 2/273...
✓ PASSED
✓ assert response (score: 1.00)
Expected exact match: 'PASS'
Running test case 3/273...
✓ PASSED
✓ assert response (score: 1.00)
Expected exact match: 'FAIL'
Running test case 4/273...
✓ PASSED
✓ assert response (score: 1.00)
Expected exact match: 'FAIL'
Running test case 5/273...
✓ PASSED
✓ assert response (score: 1.00)
Expected exact match: 'FAIL'
Running test case 6/273...
✓ PASSED
✓ assert response (score: 1.00)
Expected exact match: 'FAIL'
Running test case 7/273...
✓ PASSED
✓ assert response (score: 1.00)
Expected exact match: 'FAIL'
Running test case 8/273...
✓ PASSED
✓ assert response (score: 1.00)
Expected exact match: 'FAIL'
Running test case 9/273...
✓ PASSED
✓ assert response (score: 1.00)
Expected exact match: 'FAIL'
Running test case 10/273...
✗ FAILED
Model Response: PASS
✗ assert response (score: 0.00)
Expected exact match: 'FAIL'
Running test case 11/273...
✓ PASSED
✓ assert response (score: 1.00)
Expected exact match: 'FAIL'
Running test case 12/273...
✓ PASSED
✓ assert response (score: 1.00)
Expected exact match: 'FAIL'
Running test case 13/273...
✓ PASSED
✓ assert response (score: 1.00)
Expected exact match: 'FAIL'
Running test case 14/273...
✓ PASSED
✓ assert response (score: 1.00)
Expected exact match: 'FAIL'
Running test case 15/273...
✓ PASSED
✓ assert response (score: 1.00)
Expected exact match: 'FAIL'
Running test case 16/273...
✓ PASSED
✓ assert response (score: 1.00)
Expected exact match: 'FAIL'
Running test case 17/273...
✓ PASSED
✓ assert response (score: 1.00)
Expected exact match: 'FAIL'
Running test case 18/273...
✓ PASSED
✓ assert response (score: 1.00)
Expected exact match: 'FAIL'
Running test case 19/273...
✓ PASSED
✓ assert response (score: 1.00)
Expected exact match: 'FAIL'
Running test case 20/273...
✓ PASSED
✓ assert response (score: 1.00)
Expected exact match: 'FAIL'
Running test case 21/273...
Error: test case 21 failed: failed to call model: unexpected response from the server: 429 Too Many Requests
Too Many Requests
Usage:
gh models eval [flags]
Examples:
gh models eval my_prompt.prompt.yml
gh models eval --org my-org my_prompt.prompt.yml
Flags:
-h, --help help for eval
--json Output results in JSON format
--org string Organization to attribute usage to (omitting will attribute usage to the current actor
How might this be improved?
-
Have gh models eval sleep an appropriate amount of time based upon rate limit reset and continue from where the 429 response arose
-
Avoid printing the usage statement for errors unrelated to invalid arguments
What is the problem?
The GitHub CLI team is using
gh modelsto help detect whether a newly createdcli/cliissue is spammy or not as part of cli/cli#11316.In order to help evaluate whether our system prompts can accurately determine if an issue is spammy, a standalone
eval.shscript was created to assess a number of scenarios we've seen incli/cliissues.When rate limits are reached,
gh modelsdoes not gracefully handle the429 Too Many Requestsresponse from the API:How might this be improved?
Have
gh models evalsleep an appropriate amount of time based upon rate limit reset and continue from where the 429 response aroseAvoid printing the usage statement for errors unrelated to invalid arguments