Skip to content

Instance random get stopped while live migration triggered by host maintenance #13010

@jgotteswinter

Description

@jgotteswinter

While enabling maintenance mode i see random instances getting stopped while the host is evacuated, the majority is migrated without any issues. But sometimes i see a instance which should have been live migrated being stopped.

the management server says this

2026-04-13 10:26:54,986 INFO [c.c.h.HighAvailabilityManagerExtImpl] (HA-Worker-1:[ctx-7abbe53d, work-3314]) (logid:5ce65c99) Migration attempt: for VM VM instance {"id":4930,"instanceName":"i-55-4930-VM","state":"Running","type":"User","uuid":"cf19-00b6-465e-98f1-c63b4860498d"}from host Host {"id":18,"name":"XXXch02","type":"Routing","uuid":"dc51-a18d-4f7d-9a2e-7dfbb7a1b908"}. Starting attempt: 1/5 times. 2026-04-13 10:42:32,197 INFO [c.c.v.ClusteredVirtualMachineManagerImpl] (Work-Job-Executor-21:[ctx-1e4a3543, job-742712/job-743693, ctx-ff1f267b]) (logid:279e8d1b) Migrating VM instance {"id":4930,"instanceName":"i-55-4930-VM","state":"Running","type":"User","uuid":"cf19-00b6-465e-98f1-c63b4860498d"} to Dest[Zone(Id)-Pod(Id)-Cluster(Id)-Host(Id)-Storage(Volume(Id|Type-->Pool(Id))] : Dest[Zone(3)-Pod(3)-Cluster(3)-Host(18)-Storage()] 2026-04-13 10:42:32,349 WARN [c.c.v.ClusteredVirtualMachineManagerImpl] (Work-Job-Executor-21:[ctx-1e4a3543, job-742712/job-743693, ctx-ff1f267b]) (logid:279e8d1b) Unable to migrate VM instance {"id":4930,"instanceName":"i-55-4930-VM","state":"Running","type":"User","uuid":"cf19-00b6-465e-98f1-c63b4860498d"} to Host {"id":18,"name":"XXXch02","type":"Routing","uuid":"dc51-a18d-4f7d-9a2e-7dfbb7a1b908"} due to [Resource [Host:18] is unreachable: Host 18: Operation timed out] com.cloud.exception.AgentUnavailableException: Resource [Host:18] is unreachable: Host 18: Operation timed out 2026-04-13 10:43:27,247 INFO [c.c.r.ResourceManagerImpl] (AgentMonitor-1:[ctx-6e6b2b3f]) (logid:afd387b5) Attempting maintenance for Host {"id":21,"name":"XXXch03","type":"Routing","uuid":"eacf-b3e7-4aa9-b4ae-ff5a41862c06"} found pending migration for VM instance {"id":4930,"instanceName":"i-55-4930-VM","state":"Stopping","type":"User","uuid":"cf19-00b6-465e-98f1-c63b4860498d"}. 2026-04-13 10:43:40,248 ERROR [c.c.v.VmWorkJobHandlerProxy] (Work-Job-Executor-21:[ctx-1e4a3543, job-742712/job-743693, ctx-ff1f267b]) (logid:279e8d1b) Invocation exception, caused by: com.cloud.utils.exception.CloudRuntimeException: Unable to migrate VM instance {"id":4930,"instanceName":"i-55-4930-VM","state":"Running","type":"User","uuid":"cf19-00b6-465e-98f1-c63b4860498d"} 2026-04-13 10:43:40,248 INFO [c.c.v.VmWorkJobHandlerProxy] (Work-Job-Executor-21:[ctx-1e4a3543, job-742712/job-743693, ctx-ff1f267b]) (logid:279e8d1b) Rethrow exception com.cloud.utils.exception.CloudRuntimeException: Unable to migrate VM instance {"id":4930,"instanceName":"i-55-4930-VM","state":"Running","type":"User","uuid":"cf19-00b6-465e-98f1-c63b4860498d"} 2026-04-13 10:43:40,248 ERROR [c.c.v.VmWorkJobDispatcher] (Work-Job-Executor-21:[ctx-1e4a3543, job-742712/job-743693]) (logid:279e8d1b) Unable to complete AsyncJob {"accountId":1,"cmd":"com.cloud.vm.VmWorkMigrateAway","cmdInfo":"rO0ABXNyAB5jb20uY2xvdWQudm0uVm1Xb3JrTWlncmF0ZUF3YXmt4MX4jtcEmwIAAUoACXNyY0hvc3RJZHhyABNjb20uY2xvdWQudm0uVm1Xb3Jrn5m2VvAlZ2sCAARKAAlhY2NvdW50SWRKAAZ1c2VySWRKAAR2bUlkTAALaGFuZGxlck5hbWV0ABJMamF2YS9sYW5nL1N0cmluZzt4cAAAAAAAAAABAAAAAAAAAAEAAAAAAAATQnQAGVZpcnR1YWxNYWNoaW5lTWFuYWdlckltcGwAAAAAAAAAFQ","cmdVersion":0,"completeMsid":null,"created":"Mon Apr 13 10:42:31 UTC 2026","id":743693,"initMsid":90520733699643,"instanceId":null,"instanceType":null,"lastPolled":null,"lastUpdated":null,"processStatus":0,"removed":null,"result":null,"resultCode":0,"status":"IN_PROGRESS","userId":1,"uuid":"1401-8cf9-4276-ab57-c6a844371dd2"}, job origin: 742712 com.cloud.utils.exception.CloudRuntimeException: Unable to migrate VM instance {"id":4930,"instanceName":"i-55-4930-VM","state":"Running","type":"User","uuid":"cf19-00b6-465e-98f1-c63b4860498d"}

i would expect to just leave the instance alone up and running on its origin host and trigger a failure for the maintenance mode.

versions

ACS 4.22
Ubuntu 24.04
KVM

Metadata

Metadata

Labels

Type

No fields configured for Bug.

Projects

Status

Todo

Relationships

None yet

Development

No branches or pull requests

Issue actions