From #80129, several of the Vector2/3/4 and Vector<T> APIs involving reading from or writing to a span were changed from intrinsics to managed methods.
For the most part this is generally correct/good. However, there are some notable performance differences due to the IR that gets created.
Notably, consider the following minimal and self-contained example:
private static (int, int) Load(int[] array, int index)
{
if ((index < 0) || ((array.Length - index) < 2))
{
throw new ArgumentOutOfRangeException();
}
return (array[index + 0], array[index + 1]);
}
This creates two notable trees (similarly if a throw helper is used):
STMT00000 ( 0x000[E-] ... ??? )
[000003] ----------- * JTRUE void
[000002] ----------- \--* LT int
[000000] ----------- +--* LCL_VAR int V01 arg1
[000001] ----------- \--* CNS_INT int 0
and
STMT00004 ( 0x004[E-] ... ??? )
[000018] ---X------- * JTRUE void
[000017] ---X------- \--* GE int
[000015] ---X------- +--* SUB int
[000013] ---X------- | +--* ARR_LENGTH int
[000012] ----------- | | \--* LCL_VAR ref V00 arg0
[000014] ----------- | \--* LCL_VAR int V01 arg1
[000016] ----------- \--* CNS_INT int 2
This is significantly different from the intrinsic handling which directly created BOUNDS_CHECK nodes:
[000067] ---X------- +--* COMMA ref
[000061] ---X------- | +--* BOUNDS_CHECK_ArgRng void
[000055] ----------- | | +--* LCL_VAR int V11 loc5
[000060] ---X------- | | \--* ARR_LENGTH int
[000054] ----------- | | \--* LCL_VAR ref V08 loc2
[000066] ---X------- | \--* COMMA ref
[000065] ---X------- | +--* BOUNDS_CHECK_ArgRng void
[000063] ----------- | | +--* ADD int
[000062] ----------- | | | +--* LCL_VAR int V11 loc5
[000056] ----------- | | | \--* CNS_INT int 3
[000064] ---X------- | | \--* ARR_LENGTH int
[000058] ----------- | | \--* LCL_VAR ref V08 loc2
[000057] ----------- | \--* LCL_VAR ref V08 loc2
[000059] ----------- \--* LCL_VAR int V11 loc5
Because these aren't BOUNDS_CHECK nodes, the JIT throughput is not only "less efficient" but the optimizations that kick in are as well and it results in overall worse codegen.
From #80129, several of the
Vector2/3/4andVector<T>APIs involving reading from or writing to a span were changed from intrinsics to managed methods.For the most part this is generally correct/good. However, there are some notable performance differences due to the IR that gets created.
Notably, consider the following minimal and self-contained example:
This creates two notable trees (similarly if a throw helper is used):
and
This is significantly different from the intrinsic handling which directly created
BOUNDS_CHECKnodes:Because these aren't
BOUNDS_CHECKnodes, the JIT throughput is not only "less efficient" but the optimizations that kick in are as well and it results in overall worse codegen.