I was looking at logs from job https://helix.dot.net/api/jobs/fb0d553f-40cc-48b8-8ff2-5e7ca6762430/workitems?api-version=2019-06-17 that ran on the ARM64 queue.
I found it surprising that each of these workitems took minutes to finish, even though I know most of these tests complete in seconds.
We're losing machine time somewhere I couldn't pinpoint where from the logs:
The console log for the System.Collections.Tests workitem has this:
Console log: 'System.Collections.Tests' from job fb0d553f-40cc-48b8-8ff2-5e7ca6762430 workitem 35fcc6c3-ed55-4c7f-9b23-953d667f7502 (windows.10.arm64.open) executed on machine DDARM64S-033
C:\h\w\B668097B\w\B5910958\e>taskkill.exe /f /im corerun.exe
ERROR: The process "corerun.exe" not found.
C:\h\w\B668097B\w\B5910958\e>call RunTests.cmd --runtime-path C:\h\w\B668097B\p
----- start Wed 07/27/2022 2:41:39.11 =============== To repro directly: =====================================================
pushd C:\h\w\B668097B\w\B5910958\e\
System.Collections.Tests.exe -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing -xml testResults.xml
popd
===========================================================================================================
C:\h\w\B668097B\w\B5910958\e>System.Collections.Tests.exe -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing -xml testResults.xml
/* Skipping for brevity */
Finished System.Collections.Tests, Version=7.0.0.0, Culture=neutral, PublicKeyToken=cc7b13ffcd2ddd51
Tests run: 32511, Errors: 0, Failures: 0, Skipped: 109. Time: 5.1977535s
----- end Wed 07/27/2022 2:41:48.11 ----- exit code 0 ----------------------------------------------------------
2022-07-27T09:41:50.817Z INFO run.py run(48) main Beginning reading of test results.
2022-07-27T09:41:50.833Z INFO run.py __init__(42) read_results Searching 'C:\h\w\B668097B\w\B5910958\e' for test results files
2022-07-27T09:41:50.833Z INFO run.py __init__(48) read_results Found results file C:\h\w\B668097B\w\B5910958\e\testResults.xml with format xunit
2022-07-27T09:41:53.301Z INFO run.py __init__(42) read_results Searching 'C:\h\w\B668097B\w\B5910958\uploads' for test results files
2022-07-27T09:41:53.301Z INFO run.py packing_test_reporter(30) report_results Packing 32511 test reports to 'C:\h\w\B668097B\w\B5910958\e\__test_report.json'
2022-07-27T09:41:53.848Z INFO run.py packing_test_reporter(33) report_results Packed 12542843 bytes
Did not find dumps, skipping dump docs generation.
So we know the "useful work" finished in less than 20 seconds. Of course I expect some overhead due to downloading and uploading, but it doesn't explain the "Queued":"2022-07-27T07:27:47.745+00:00","Started":"2022-07-27T09:41:09.609+00:00","Finished":"2022-07-27T09:47:58.545+00:00","Delay":"02:13:21.8640000","Duration":"00:06:48.9360000". Where did we spend the other 6 minutes?
Digging into the run_client.py log, I see:
2022-07-27T09:41:38.864Z INFO executor(833) _execute_command running %HELIX_CORRELATION_PAYLOAD%\scripts\0f507811259646bf86fd7d51a4ddb483\execute.cmd in C:\h\w\B668097B\w\B5910958\e max 2700 seconds
2022-07-27T09:41:54.411Z INFO job(38) wait Work item's process exited normally.
2022-07-27T09:42:41.259Z INFO interval(82) event Invoke callback for timer, arguments: None
2022-07-27T09:42:41.259Z INFO servicebusrepository(106) renew_workitem_lock Renewing lock for ServiceBus item MessageId 'ed43292dcc4840348b0966a2025fd012' with url 'https://nethelix.servicebus.windows.net/windows.10.arm64.open/messages/3732380/d8ac9553-6b65-4f19-b839-66ab560b90d4'
2022-07-27T09:42:41.406Z INFO saferequests(87) request_with_retry Response complete with status code '200'
2022-07-27T09:42:41.406Z INFO servicebusrepository(116) renew_workitem_lock Renewed work item lock. Status Code: 200
2022-07-27T09:42:41.406Z INFO interval(54) setup_timer Timer set to 90 seconds
2022-07-27T09:44:11.418Z INFO interval(82) event Invoke callback for timer, arguments: None
2022-07-27T09:44:11.418Z INFO servicebusrepository(106) renew_workitem_lock Renewing lock for ServiceBus item MessageId 'ed43292dcc4840348b0966a2025fd012' with url 'https://nethelix.servicebus.windows.net/windows.10.arm64.open/messages/3732380/d8ac9553-6b65-4f19-b839-66ab560b90d4'
2022-07-27T09:44:11.559Z INFO saferequests(87) request_with_retry Response complete with status code '200'
2022-07-27T09:44:11.559Z INFO servicebusrepository(116) renew_workitem_lock Renewed work item lock. Status Code: 200
2022-07-27T09:44:11.559Z INFO interval(54) setup_timer Timer set to 90 seconds
2022-07-27T09:45:28.108Z INFO interval(70) cancel_abort_after Cancel abort_after for timer
2022-07-27T09:45:28.108Z INFO interval(67) set_abort_after_timer Set the timer's abort_after to 660 seconds from now: abort_after = 2022-07-27 02:56:28.108918
2022-07-27T09:45:28.108Z INFO executor(759) _dump_file_upload No dump files to upload.
2022-07-27T09:45:28.108Z INFO interval(67) set_abort_after_timer Set the timer's abort_after to 180 seconds from now: abort_after = 2022-07-27 02:48:28.108918
2022-07-27T09:45:28.108Z INFO executor(959) _upload_files Allowing maximum 0.0 seconds to upload 0 bytes of files.
2022-07-27T09:45:28.108Z INFO interval(67) set_abort_after_timer Set the timer's abort_after to 60.0 seconds from now: abort_after = 2022-07-27 02:46:28.108918
2022-07-27T09:45:28.108Z INFO executor(877) _execute_command Finished _execute_command, exit code: 0
2022-07-27T09:45:28.108Z INFO executor(920) _upload_single_file Uploading the file console.6425316b.log
So we see that the workitem exited at 09:41:54 but it looks like we spent three and a half minutes until 09:45:28 doing nothing?
Then we spend another 2 minutes publishing the results:
2022-07-27T09:45:53.168Z INFO azure_devops_result_publisher(153) _process uploaded 997 results
2022-07-27T09:46:11.535Z INFO azure_devops_result_publisher(153) _process uploaded 1962 results
2022-07-27T09:46:38.378Z INFO azure_devops_result_publisher(153) _process uploaded 2959 results
2022-07-27T09:47:05.906Z INFO azure_devops_result_publisher(153) _process uploaded 3957 results
2022-07-27T09:47:12.438Z INFO interval(82) event Invoke callback for timer, arguments: None
2022-07-27T09:47:12.516Z INFO servicebusrepository(106) renew_workitem_lock Renewing lock for ServiceBus item MessageId 'ed43292dcc4840348b0966a2025fd012' with url 'https://nethelix.servicebus.windows.net/windows.10.arm64.open/messages/3732380/d8ac9553-6b65-4f19-b839-66ab560b90d4'
2022-07-27T09:47:13.156Z INFO saferequests(87) request_with_retry Response complete with status code '200'
2022-07-27T09:47:13.344Z INFO servicebusrepository(116) renew_workitem_lock Renewed work item lock. Status Code: 200
2022-07-27T09:47:13.438Z INFO interval(54) setup_timer Timer set to 90 seconds
2022-07-27T09:47:27.007Z INFO azure_devops_result_publisher(153) _process uploaded 4860 results
2022-07-27T09:47:44.195Z INFO azure_devops_result_publisher(153) _process uploaded 5600 results
It feels like we should be able to do something to make this more efficient.
I was looking at logs from job https://helix.dot.net/api/jobs/fb0d553f-40cc-48b8-8ff2-5e7ca6762430/workitems?api-version=2019-06-17 that ran on the ARM64 queue.
I found it surprising that each of these workitems took minutes to finish, even though I know most of these tests complete in seconds.
We're losing machine time somewhere I couldn't pinpoint where from the logs:
The console log for the System.Collections.Tests workitem has this:
So we know the "useful work" finished in less than 20 seconds. Of course I expect some overhead due to downloading and uploading, but it doesn't explain the
"Queued":"2022-07-27T07:27:47.745+00:00","Started":"2022-07-27T09:41:09.609+00:00","Finished":"2022-07-27T09:47:58.545+00:00","Delay":"02:13:21.8640000","Duration":"00:06:48.9360000". Where did we spend the other 6 minutes?Digging into the
run_client.pylog, I see:So we see that the workitem exited at 09:41:54 but it looks like we spent three and a half minutes until 09:45:28 doing nothing?
Then we spend another 2 minutes publishing the results:
It feels like we should be able to do something to make this more efficient.