i64x2.min_s and i64x2.max_s instructions#417
i64x2.min_s and i64x2.max_s instructions#417Maratyszcza wants to merge 1 commit intoWebAssembly:masterfrom
Conversation
|
I don't think this meets the bar for inclusion. The codegen is not great, and half of the use cases are SIMD libraries which expose such instructions (they don't use it). |
|
It is expected that most uses of 64-bit integer operations is through either high-level wrappers or auto-vectorization: there are usually more efficient ways to do computations within narrower data types, but they are ISA-specific (e.g. on ARM NEON we may use saturated 32-bit arithmetics, but it is not portable to x86). Thus it is mainly the codes that trade some performance for portability (through high-level wrapper libraries or through auto-vectorization) that use 64-bit arithmetics. IMO lowering on recentish systems isn't bad: 4 instructions on SSE4.2, 3 instructions on ARMv7 NEON, 2 instruction on ARM64 and AVX. Without specialized |
0f30463 to
0614819
Compare
|
Adding a preliminary vote for the inclusion of i64x2 signed min/max operations to the SIMD proposal below. Please vote with - 👍 For including i64x2 signed min/max operations |
|
The community group unanimously decided against including these instructions in the 1/29/21 meeting (#429). |
Introduction
This is proposal to add 64-bit variant of the existing
min_sandmax_sinstructions. Only x86 processors with AVX512 natively support these instructions, but ARMv7 NEON, ARM64 and x86 with SSE4.2 or AVX can efficiently emulate them with 2-4 instructions.Applications
Mapping to Common Instruction Sets
This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.
x86/x86-64 processors with AVX512F and AVX512VL instruction sets
y = i64x2.min_s(a, b)is lowered toVPMINSQ xmm_y, xmm_a, xmm_by = i64x2.max_s(a, b)is lowered toVPMAXSQ xmm_y, xmm_a, xmm_bx86/x86-64 processors with AVX instruction set
y = i64x2.min_s(a, b)(yis notaandyis notb) is lowered to:VPCMPGTQ xmm_y, xmm_a, xmm_bVPBLENDVB xmm_y, xmm_a, xmm_b, xmm_yy = i64x2.max_s(a, b)(yis notaandyis notb) is lowered to:VPCMPGTQ xmm_y, xmm_a, xmm_bVPBLENDVB xmm_y, xmm_b, xmm_a, xmm_yx86/x86-64 processors with SSE4.2 instruction set
y = i64x2.min_s(a, b)(yis notbanda/b/yare not inxmm0) is lowered to:MOVDQA xmm0, xmm_aMOVDQA xmm_y, xmm_aPCMPGTQ xmm0, xmm_bPBLENDVB xmm_y, xmm_by = i64x2.max_s(a, b)(yis notaanda/b/yare not inxmm0) is lowered to:MOVDQA xmm0, xmm_aMOVDQA xmm_y, xmm_bPCMPGTQ xmm0, xmm_bPBLENDVB xmm_y, xmm_ax86/x86-64 processors with SSE4.1 instruction set
Based on this answer by user aqrit on Stack Overflow
y = i64x2.min_s(a, b)(yis notaandyis notbanda/b/yare not inxmm0) is lowered to:MOVDQA xmm0, xmm_bMOVDQA xmm_y, xmm_aPSUBQ xmm0, xmm_aPCMPEQD xmm_y, xmm_bPAND xmm0, xmm_yMOVDQA xmm_y, xmm_aPCMPGTD xmm_y, xmm_bPOR xmm0, xmm_yMOVDQA xmm_y, xmm_aPSHUFD xmm0, xmm0, 0xF5PBLENDVB xmm_y, xmm_by = i64x2.max_s(a, b)(yis notaandyis notbanda/b/yare not inxmm0) is lowered to:MOVDQA xmm0, xmm_bMOVDQA xmm_y, xmm_aPSUBQ xmm0, xmm_aPCMPEQD xmm_y, xmm_bPAND xmm0, xmm_yMOVDQA xmm_y, xmm_aPCMPGTD xmm_y, xmm_bPOR xmm0, xmm_yMOVDQA xmm_y, xmm_bPSHUFD xmm0, xmm0, 0xF5PBLENDVB xmm_y, xmm_ax86/x86-64 processors with SSE2 instruction set
Based on this answer by user aqrit on Stack Overflow
y = i64x2.min_s(a, b)(yis notaandyis notb) is lowered to:MOVDQA xmm_y, xmm_bMOVDQA xmm_tmp, xmm_aPSUBQ xmm_y, xmm_aPCMPEQD xmm_tmp, xmm_bPAND xmm_y, xmm_tmpMOVDQA xmm_tmp, xmm_aPCMPGTD xmm_tmp, xmm_bPOR xmm_y, xmm_tmpMOVDQA xmm_tmp, xmm_bPSHUFD xmm_y, xmm_y, 0xF5PAND xmm_tmp, xmm_yPANDN xmm_y, xmm_aPOR xmm_y, xmm_tmpy = i64x2.max_s(a, b)(yis notaandyis notb) is lowered to:MOVDQA xmm_y, xmm_bMOVDQA xmm_tmp, xmm_aPSUBQ xmm_y, xmm_aPCMPEQD xmm_tmp, xmm_bPAND xmm_y, xmm_tmpMOVDQA xmm_tmp, xmm_aPCMPGTD xmm_tmp, xmm_bPOR xmm_y, xmm_tmpMOVDQA xmm_tmp, xmm_aPSHUFD xmm_y, xmm_y, 0xF5PAND xmm_tmp, xmm_yPANDN xmm_y, xmm_bPOR xmm_y, xmm_tmpARM64 processors
y = i64x2.min_s(a, b)(yis notaandyis notb) is lowered to:CMGT Vy.2D, Va.2D, Vb.2DBSL Vy.16B, Vb.16B, Va.16By = i64x2.max_s(a, b)(yis notaandyis notb) is lowered to:CMGT Vy.2D, Va.2D, Vb.2DBSL Vy.16B, Va.16B, Vb.16BARMv7 processors with NEON instruction set
Based on this answer by user aqrit on Stack Overflow
y = i64x2.min_s(a, b)(yis notaandyis notb) is lowered to:VQSUB.S64 Qy, Qb, QaVSHR.S64 Qy, Qy, #63VBSL Qy, Qb, Qay = i64x2.max_s(a, b)(yis notaandyis notb) is lowered to:VQSUB.S64 Qy, Qb, QaVSHR.S64 Qy, Qy, #63VBSL Qy, Qa, Qb