Skip to content

Unaccounted lost time in Helix workitem processing #10168

Description

@MichalStrehovsky
  • This issue is blocking
  • This issue is causing unreasonable pain

I was looking at logs from job https://helix.dot.net/api/jobs/fb0d553f-40cc-48b8-8ff2-5e7ca6762430/workitems?api-version=2019-06-17 that ran on the ARM64 queue.

I found it surprising that each of these workitems took minutes to finish, even though I know most of these tests complete in seconds.

We're losing machine time somewhere I couldn't pinpoint where from the logs:

The console log for the System.Collections.Tests workitem has this:

Console log: 'System.Collections.Tests' from job fb0d553f-40cc-48b8-8ff2-5e7ca6762430 workitem 35fcc6c3-ed55-4c7f-9b23-953d667f7502 (windows.10.arm64.open) executed on machine DDARM64S-033

C:\h\w\B668097B\w\B5910958\e>taskkill.exe /f /im corerun.exe 
ERROR: The process "corerun.exe" not found.

C:\h\w\B668097B\w\B5910958\e>call RunTests.cmd --runtime-path C:\h\w\B668097B\p 
----- start Wed 07/27/2022  2:41:39.11 ===============  To repro directly: ===================================================== 
pushd C:\h\w\B668097B\w\B5910958\e\
System.Collections.Tests.exe -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing -xml testResults.xml 
popd
===========================================================================================================

C:\h\w\B668097B\w\B5910958\e>System.Collections.Tests.exe -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing -xml testResults.xml  

/* Skipping for brevity */

Finished System.Collections.Tests, Version=7.0.0.0, Culture=neutral, PublicKeyToken=cc7b13ffcd2ddd51

Tests run: 32511, Errors: 0, Failures: 0, Skipped: 109. Time: 5.1977535s
----- end Wed 07/27/2022  2:41:48.11 ----- exit code 0 ----------------------------------------------------------
2022-07-27T09:41:50.817Z	INFO   	run.py	run(48)	main	Beginning reading of test results.
2022-07-27T09:41:50.833Z	INFO   	run.py	__init__(42)	read_results	Searching 'C:\h\w\B668097B\w\B5910958\e' for test results files
2022-07-27T09:41:50.833Z	INFO   	run.py	__init__(48)	read_results	Found results file C:\h\w\B668097B\w\B5910958\e\testResults.xml with format xunit
2022-07-27T09:41:53.301Z	INFO   	run.py	__init__(42)	read_results	Searching 'C:\h\w\B668097B\w\B5910958\uploads' for test results files
2022-07-27T09:41:53.301Z	INFO   	run.py	packing_test_reporter(30)	report_results	Packing 32511 test reports to 'C:\h\w\B668097B\w\B5910958\e\__test_report.json'
2022-07-27T09:41:53.848Z	INFO   	run.py	packing_test_reporter(33)	report_results	Packed 12542843 bytes
Did not find dumps, skipping dump docs generation.

So we know the "useful work" finished in less than 20 seconds. Of course I expect some overhead due to downloading and uploading, but it doesn't explain the "Queued":"2022-07-27T07:27:47.745+00:00","Started":"2022-07-27T09:41:09.609+00:00","Finished":"2022-07-27T09:47:58.545+00:00","Delay":"02:13:21.8640000","Duration":"00:06:48.9360000". Where did we spend the other 6 minutes?

Digging into the run_client.py log, I see:

2022-07-27T09:41:38.864Z	INFO   	executor(833)	_execute_command	running %HELIX_CORRELATION_PAYLOAD%\scripts\0f507811259646bf86fd7d51a4ddb483\execute.cmd in C:\h\w\B668097B\w\B5910958\e max 2700 seconds
2022-07-27T09:41:54.411Z	INFO   	job(38)	wait	Work item's process exited normally.
2022-07-27T09:42:41.259Z	INFO   	interval(82)	event	Invoke callback for timer, arguments: None
2022-07-27T09:42:41.259Z	INFO   	servicebusrepository(106)	renew_workitem_lock	Renewing lock for ServiceBus item MessageId 'ed43292dcc4840348b0966a2025fd012' with url 'https://nethelix.servicebus.windows.net/windows.10.arm64.open/messages/3732380/d8ac9553-6b65-4f19-b839-66ab560b90d4'
2022-07-27T09:42:41.406Z	INFO   	saferequests(87)	request_with_retry	Response complete with status code '200'
2022-07-27T09:42:41.406Z	INFO   	servicebusrepository(116)	renew_workitem_lock	Renewed work item lock. Status Code: 200
2022-07-27T09:42:41.406Z	INFO   	interval(54)	setup_timer	Timer set to 90 seconds
2022-07-27T09:44:11.418Z	INFO   	interval(82)	event	Invoke callback for timer, arguments: None
2022-07-27T09:44:11.418Z	INFO   	servicebusrepository(106)	renew_workitem_lock	Renewing lock for ServiceBus item MessageId 'ed43292dcc4840348b0966a2025fd012' with url 'https://nethelix.servicebus.windows.net/windows.10.arm64.open/messages/3732380/d8ac9553-6b65-4f19-b839-66ab560b90d4'
2022-07-27T09:44:11.559Z	INFO   	saferequests(87)	request_with_retry	Response complete with status code '200'
2022-07-27T09:44:11.559Z	INFO   	servicebusrepository(116)	renew_workitem_lock	Renewed work item lock. Status Code: 200
2022-07-27T09:44:11.559Z	INFO   	interval(54)	setup_timer	Timer set to 90 seconds
2022-07-27T09:45:28.108Z	INFO   	interval(70)	cancel_abort_after	Cancel abort_after for timer
2022-07-27T09:45:28.108Z	INFO   	interval(67)	set_abort_after_timer	Set the timer's abort_after to 660 seconds from now: abort_after = 2022-07-27 02:56:28.108918
2022-07-27T09:45:28.108Z	INFO   	executor(759)	_dump_file_upload	No dump files to upload.
2022-07-27T09:45:28.108Z	INFO   	interval(67)	set_abort_after_timer	Set the timer's abort_after to 180 seconds from now: abort_after = 2022-07-27 02:48:28.108918
2022-07-27T09:45:28.108Z	INFO   	executor(959)	_upload_files	Allowing maximum 0.0 seconds to upload 0 bytes of files.
2022-07-27T09:45:28.108Z	INFO   	interval(67)	set_abort_after_timer	Set the timer's abort_after to 60.0 seconds from now: abort_after = 2022-07-27 02:46:28.108918
2022-07-27T09:45:28.108Z	INFO   	executor(877)	_execute_command	Finished _execute_command, exit code: 0
2022-07-27T09:45:28.108Z	INFO   	executor(920)	_upload_single_file	Uploading the file console.6425316b.log

So we see that the workitem exited at 09:41:54 but it looks like we spent three and a half minutes until 09:45:28 doing nothing?

Then we spend another 2 minutes publishing the results:

2022-07-27T09:45:53.168Z	INFO   	azure_devops_result_publisher(153)	_process	uploaded 997 results
2022-07-27T09:46:11.535Z	INFO   	azure_devops_result_publisher(153)	_process	uploaded 1962 results
2022-07-27T09:46:38.378Z	INFO   	azure_devops_result_publisher(153)	_process	uploaded 2959 results
2022-07-27T09:47:05.906Z	INFO   	azure_devops_result_publisher(153)	_process	uploaded 3957 results
2022-07-27T09:47:12.438Z	INFO   	interval(82)	event	Invoke callback for timer, arguments: None
2022-07-27T09:47:12.516Z	INFO   	servicebusrepository(106)	renew_workitem_lock	Renewing lock for ServiceBus item MessageId 'ed43292dcc4840348b0966a2025fd012' with url 'https://nethelix.servicebus.windows.net/windows.10.arm64.open/messages/3732380/d8ac9553-6b65-4f19-b839-66ab560b90d4'
2022-07-27T09:47:13.156Z	INFO   	saferequests(87)	request_with_retry	Response complete with status code '200'
2022-07-27T09:47:13.344Z	INFO   	servicebusrepository(116)	renew_workitem_lock	Renewed work item lock. Status Code: 200
2022-07-27T09:47:13.438Z	INFO   	interval(54)	setup_timer	Timer set to 90 seconds
2022-07-27T09:47:27.007Z	INFO   	azure_devops_result_publisher(153)	_process	uploaded 4860 results
2022-07-27T09:47:44.195Z	INFO   	azure_devops_result_publisher(153)	_process	uploaded 5600 results

It feels like we should be able to do something to make this more efficient.

Metadata

Metadata

Assignees

Labels

DDFun OperationsWork items, generally linked to other tracking formats, handled by the DDFUN (formerly DDIT) team.Needs TriageA new issue that needs to be associated with an epic.On Premise HardwareUsed by FR to track issues related to our physical hardware

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions