[PowerPC] Fix i128 vcmpequb optimization for loads with range metadata and small constants#196801
Conversation
…a and small constants The combine introduced in 55aff64 lowers scalar i128 compares into vector compares by reissuing the original loads as v16i8 loads. However, the combine was reusing the original MachineMemOperand without modification. If the original i128 load carries !range metadata, the MMO encodes that range using i128 values. Reusing this MMO for a v16i8 load is incorrect as range metadata is only valid for integer scalar types and its bitwidth must match the memory VT. This patch fixes this by creating a new MachineMemOperand for the vector vector load. Additionally, we restrict the combine for constant operands to avoid cases that are better handled by scalar lowering. Small constants (fit within 16 bits) are excluded to prevent generating suboptimal vector compares.
|
@llvm/pr-subscribers-backend-powerpc Author: Amy Kwan (amy-kwan) ChangesThe combine introduced in 55aff64 lowers scalar i128 compares into vector compares by reissuing the original loads as v16i8 loads. However, the combine was reusing the original MachineMemOperand without modification. If the original i128 load carries !range metadata, the MMO encodes that range using i128 values. Reusing this MMO for a v16i8 load is incorrect as range metadata is only valid for integer scalar types and its bitwidth must match the memory VT. This patch fixes this by creating a new MachineMemOperand for the vector vector load. Additionally, we restrict the combine for constant operands to avoid cases that are better handled by scalar lowering. Small constants (fit within 16 bits) are excluded to prevent generating suboptimal vector compares. Full diff: https://github.com/llvm/llvm-project/pull/196801.diff 2 Files Affected:
diff --git a/llvm/lib/Target/PowerPC/PPCISelLowering.cpp b/llvm/lib/Target/PowerPC/PPCISelLowering.cpp
index e959100d713dd..a5fc479292717 100644
--- a/llvm/lib/Target/PowerPC/PPCISelLowering.cpp
+++ b/llvm/lib/Target/PowerPC/PPCISelLowering.cpp
@@ -15797,8 +15797,14 @@ static bool canConvertToVcmpequb(SDValue &LHS, SDValue &RHS) {
if (Operand.getValueType() != MVT::i128)
return false;
- if (Operand.getOpcode() == ISD::Constant)
- return true;
+ if (Operand.getOpcode() == ISD::Constant) {
+ auto *C = cast<ConstantSDNode>(Operand);
+ const APInt &Val = C->getAPIntValue();
+ if (Val.ult(1ULL << 16))
+ return false;
+ else
+ return true;
+ }
auto *LoadNode = dyn_cast<LoadSDNode>(Operand);
if (!LoadNode)
@@ -15849,10 +15855,19 @@ SDValue convertTwoLoadsAndCmpToVCMPEQUB(SelectionDAG &DAG, SDNode *N,
assert(Operand.getOpcode() == ISD::LOAD && "Must be LoadSDNode here.");
auto *LoadNode = cast<LoadSDNode>(Operand);
- SDValue NewLoad =
- DAG.getLoad(MVT::v16i8, DL, LoadNode->getChain(),
- LoadNode->getBasePtr(), LoadNode->getMemOperand());
- DAG.ReplaceAllUsesOfValueWith(Operand.getValue(1), NewLoad.getValue(1));
+ // Create a new MachineMemOperand without range metadata.
+ // Range metadata is only valid for integer scalar types, not vectors.
+ // The original i128 load may have range metadata, but when we convert
+ // to v16i8, that metadata is no longer semantically valid.
+ MachineMemOperand *MMO = LoadNode->getMemOperand();
+ MachineFunction &MF = DAG.getMachineFunction();
+ MachineMemOperand *NewMMO = MF.getMachineMemOperand(
+ MMO->getPointerInfo(), MMO->getFlags(), MMO->getSize(),
+ MMO->getAlign(), MMO->getAAInfo(), nullptr, MMO->getSyncScopeID(),
+ MMO->getSuccessOrdering(), MMO->getFailureOrdering());
+ SDValue NewLoad = DAG.getLoad(MVT::v16i8, DL, LoadNode->getChain(),
+ LoadNode->getBasePtr(), NewMMO);
+ DAG.ReplaceAllUsesOfValueWith(SDValue(LoadNode, 1), NewLoad.getValue(1));
return NewLoad;
};
diff --git a/llvm/test/CodeGen/PowerPC/ppc-i128-cmp.ll b/llvm/test/CodeGen/PowerPC/ppc-i128-cmp.ll
new file mode 100644
index 0000000000000..29b4c076bcadf
--- /dev/null
+++ b/llvm/test/CodeGen/PowerPC/ppc-i128-cmp.ll
@@ -0,0 +1,210 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
+; RUN: llc -mcpu=pwr8 -ppc-asm-full-reg-names -mtriple=powerpc64-ibm-aix < %s | \
+; RUN: FileCheck %s --check-prefix=CHECK-AIX
+; RUN: llc -mcpu=pwr8 -ppc-asm-full-reg-names -mtriple=powerpc64le-unknown-linux-gnu < %s | \
+; RUN: FileCheck %s --check-prefix=CHECK-LINUX
+
+define i1 @test1() {
+; CHECK-AIX-LABEL: test1:
+; CHECK-AIX: # %bb.0: # %bb
+; CHECK-AIX-NEXT: ld r3, 0(0)
+; CHECK-AIX-NEXT: ld r4, 8(0)
+; CHECK-AIX-NEXT: or r3, r4, r3
+; CHECK-AIX-NEXT: cntlzd r3, r3
+; CHECK-AIX-NEXT: rldicl r3, r3, 58, 63
+; CHECK-AIX-NEXT: blr
+;
+; CHECK-LINUX-LABEL: test1:
+; CHECK-LINUX: # %bb.0: # %bb
+; CHECK-LINUX-NEXT: ld r3, 8(0)
+; CHECK-LINUX-NEXT: ld r4, 0(0)
+; CHECK-LINUX-NEXT: or r3, r4, r3
+; CHECK-LINUX-NEXT: cntlzd r3, r3
+; CHECK-LINUX-NEXT: rldicl r3, r3, 58, 63
+; CHECK-LINUX-NEXT: blr
+bb:
+ %load = load i128, ptr null, align 16
+ %icmp = icmp eq i128 %load, 0
+ ret i1 %icmp
+}
+
+define i1 @test2() {
+; CHECK-AIX-LABEL: test2:
+; CHECK-AIX: # %bb.0: # %bb
+; CHECK-AIX-NEXT: ld r4, 8(0)
+; CHECK-AIX-NEXT: ld r3, 0(0)
+; CHECK-AIX-NEXT: xori r4, r4, 10
+; CHECK-AIX-NEXT: or r3, r4, r3
+; CHECK-AIX-NEXT: cntlzd r3, r3
+; CHECK-AIX-NEXT: rldicl r3, r3, 58, 63
+; CHECK-AIX-NEXT: blr
+;
+; CHECK-LINUX-LABEL: test2:
+; CHECK-LINUX: # %bb.0: # %bb
+; CHECK-LINUX-NEXT: ld r4, 0(0)
+; CHECK-LINUX-NEXT: ld r3, 8(0)
+; CHECK-LINUX-NEXT: xori r4, r4, 10
+; CHECK-LINUX-NEXT: or r3, r4, r3
+; CHECK-LINUX-NEXT: cntlzd r3, r3
+; CHECK-LINUX-NEXT: rldicl r3, r3, 58, 63
+; CHECK-LINUX-NEXT: blr
+bb:
+ %load = load i128, ptr null, align 16
+ %icmp = icmp eq i128 %load, 10
+ ret i1 %icmp
+}
+
+define i1 @test3() {
+; CHECK-AIX-LABEL: test3:
+; CHECK-AIX: # %bb.0: # %bb
+; CHECK-AIX-NEXT: ld r4, 8(0)
+; CHECK-AIX-NEXT: ld r3, 0(0)
+; CHECK-AIX-NEXT: xori r4, r4, 65535
+; CHECK-AIX-NEXT: or r3, r4, r3
+; CHECK-AIX-NEXT: cntlzd r3, r3
+; CHECK-AIX-NEXT: rldicl r3, r3, 58, 63
+; CHECK-AIX-NEXT: blr
+;
+; CHECK-LINUX-LABEL: test3:
+; CHECK-LINUX: # %bb.0: # %bb
+; CHECK-LINUX-NEXT: ld r4, 0(0)
+; CHECK-LINUX-NEXT: ld r3, 8(0)
+; CHECK-LINUX-NEXT: xori r4, r4, 65535
+; CHECK-LINUX-NEXT: or r3, r4, r3
+; CHECK-LINUX-NEXT: cntlzd r3, r3
+; CHECK-LINUX-NEXT: rldicl r3, r3, 58, 63
+; CHECK-LINUX-NEXT: blr
+bb:
+ %load = load i128, ptr null, align 16
+ %icmp = icmp eq i128 %load, 65535
+ ret i1 %icmp
+}
+
+define i1 @test4() {
+; CHECK-AIX-LABEL: test4:
+; CHECK-AIX: # %bb.0: # %bb
+; CHECK-AIX-NEXT: li r3, 0
+; CHECK-AIX-NEXT: lxvw4x vs34, 0, r3
+; CHECK-AIX-NEXT: ld r3, L..C0(r2) # %const.0
+; CHECK-AIX-NEXT: lxvd2x vs35, 0, r3
+; CHECK-AIX-NEXT: vcmpequb. v2, v2, v3
+; CHECK-AIX-NEXT: mfocrf r3, 2
+; CHECK-AIX-NEXT: rlwinm r3, r3, 25, 31, 31
+; CHECK-AIX-NEXT: blr
+;
+; CHECK-LINUX-LABEL: test4:
+; CHECK-LINUX: # %bb.0: # %bb
+; CHECK-LINUX-NEXT: li r3, 0
+; CHECK-LINUX-NEXT: lxvd2x vs34, 0, r3
+; CHECK-LINUX-NEXT: addis r3, r2, .LCPI3_0@toc@ha
+; CHECK-LINUX-NEXT: addi r3, r3, .LCPI3_0@toc@l
+; CHECK-LINUX-NEXT: lxvd2x vs35, 0, r3
+; CHECK-LINUX-NEXT: vcmpequb. v2, v2, v3
+; CHECK-LINUX-NEXT: mfocrf r3, 2
+; CHECK-LINUX-NEXT: rlwinm r3, r3, 25, 31, 31
+; CHECK-LINUX-NEXT: blr
+bb:
+ %load = load i128, ptr null, align 16
+ %icmp = icmp eq i128 %load, 65536
+ ret i1 %icmp
+}
+
+; Test using the !range metadata
+define i1 @test5() {
+; CHECK-AIX-LABEL: test5:
+; CHECK-AIX: # %bb.0: # %bb
+; CHECK-AIX-NEXT: ld r3, 0(0)
+; CHECK-AIX-NEXT: ld r4, 8(0)
+; CHECK-AIX-NEXT: or r3, r4, r3
+; CHECK-AIX-NEXT: cntlzd r3, r3
+; CHECK-AIX-NEXT: rldicl r3, r3, 58, 63
+; CHECK-AIX-NEXT: blr
+;
+; CHECK-LINUX-LABEL: test5:
+; CHECK-LINUX: # %bb.0: # %bb
+; CHECK-LINUX-NEXT: ld r3, 8(0)
+; CHECK-LINUX-NEXT: ld r4, 0(0)
+; CHECK-LINUX-NEXT: or r3, r4, r3
+; CHECK-LINUX-NEXT: cntlzd r3, r3
+; CHECK-LINUX-NEXT: rldicl r3, r3, 58, 63
+; CHECK-LINUX-NEXT: blr
+bb:
+ %load = load i128, ptr null, align 16, !range !0
+ %icmp = icmp eq i128 %load, 0
+ ret i1 %icmp
+}
+
+define i1 @test6() {
+; CHECK-AIX-LABEL: test6:
+; CHECK-AIX: # %bb.0: # %bb
+; CHECK-AIX-NEXT: ld r4, 8(0)
+; CHECK-AIX-NEXT: ld r3, 0(0)
+; CHECK-AIX-NEXT: xori r4, r4, 65535
+; CHECK-AIX-NEXT: or r3, r4, r3
+; CHECK-AIX-NEXT: cntlzd r3, r3
+; CHECK-AIX-NEXT: rldicl r3, r3, 58, 63
+; CHECK-AIX-NEXT: blr
+;
+; CHECK-LINUX-LABEL: test6:
+; CHECK-LINUX: # %bb.0: # %bb
+; CHECK-LINUX-NEXT: ld r4, 0(0)
+; CHECK-LINUX-NEXT: ld r3, 8(0)
+; CHECK-LINUX-NEXT: xori r4, r4, 65535
+; CHECK-LINUX-NEXT: or r3, r4, r3
+; CHECK-LINUX-NEXT: cntlzd r3, r3
+; CHECK-LINUX-NEXT: rldicl r3, r3, 58, 63
+; CHECK-LINUX-NEXT: blr
+bb:
+ %load = load i128, ptr null, align 16, !range !1
+ %icmp = icmp eq i128 %load, 65535
+ ret i1 %icmp
+}
+
+define i1 @test7() {
+; CHECK-AIX-LABEL: test7:
+; CHECK-AIX: # %bb.0: # %bb
+; CHECK-AIX-NEXT: li r3, 0
+; CHECK-AIX-NEXT: blr
+;
+; CHECK-LINUX-LABEL: test7:
+; CHECK-LINUX: # %bb.0: # %bb
+; CHECK-LINUX-NEXT: li r3, 0
+; CHECK-LINUX-NEXT: blr
+bb:
+ %load = load i128, ptr null, align 16, !range !1
+ %icmp = icmp eq i128 %load, 65536
+ ret i1 %icmp
+}
+
+define i1 @test8() {
+; CHECK-AIX-LABEL: test8:
+; CHECK-AIX: # %bb.0: # %bb
+; CHECK-AIX-NEXT: li r3, 0
+; CHECK-AIX-NEXT: lxvw4x vs34, 0, r3
+; CHECK-AIX-NEXT: ld r3, L..C1(r2) # %const.0
+; CHECK-AIX-NEXT: lxvd2x vs35, 0, r3
+; CHECK-AIX-NEXT: vcmpequb. v2, v2, v3
+; CHECK-AIX-NEXT: mfocrf r3, 2
+; CHECK-AIX-NEXT: rlwinm r3, r3, 25, 31, 31
+; CHECK-AIX-NEXT: blr
+;
+; CHECK-LINUX-LABEL: test8:
+; CHECK-LINUX: # %bb.0: # %bb
+; CHECK-LINUX-NEXT: li r3, 0
+; CHECK-LINUX-NEXT: lxvd2x vs34, 0, r3
+; CHECK-LINUX-NEXT: addis r3, r2, .LCPI7_0@toc@ha
+; CHECK-LINUX-NEXT: addi r3, r3, .LCPI7_0@toc@l
+; CHECK-LINUX-NEXT: lxvd2x vs35, 0, r3
+; CHECK-LINUX-NEXT: vcmpequb. v2, v2, v3
+; CHECK-LINUX-NEXT: mfocrf r3, 2
+; CHECK-LINUX-NEXT: rlwinm r3, r3, 25, 31, 31
+; CHECK-LINUX-NEXT: blr
+bb:
+ %load = load i128, ptr null, align 16, !range !2
+ %icmp = icmp eq i128 %load, 65536
+ ret i1 %icmp
+}
+
+!0 = !{i128 0, i128 2}
+!1 = !{i128 0, i128 65536}
+!2 = !{i128 0, i128 65537}
|
|
✅ With the latest revision this PR passed the C/C++ code formatter. |
| if (Operand.getOpcode() == ISD::Constant) { | ||
| auto *C = cast<ConstantSDNode>(Operand); | ||
| const APInt &Val = C->getAPIntValue(); | ||
| if (Val.ult(1ULL << 16)) |
There was a problem hiding this comment.
I suggest that add a comment here " When comparing an i128 value loaded from memory against a constant, the comparison can be lowered to xori or or if the constant is less than 2¹⁶, since xori's immediate field is 16 bits wide."
and I guess comparing with constant zero when target is 32bit. convertTwoLoadsAndCmpToVCMPEQUB will has less instructions.
if (Val.ult(1ULL << 16))
-->
if (Val.ult(1ULL << 16) && DAG.getTargetLoweringInfo().getPointerTy(DAG.getDataLayout()).getSizeInBits() == 64 )
diggerlin
left a comment
There was a problem hiding this comment.
LGTM, but please address my comment first.
| @@ -3,6 +3,8 @@ | |||
| ; RUN: FileCheck %s --check-prefix=CHECK-AIX | |||
There was a problem hiding this comment.
change CHECK-AIX --> CHECK-AIX64
|
Cherry-pick this into the release branch to resolve an assertion involving range metadata. /cherry-pick 1907b58 |
|
/pull-request #198177 |
…a and small constants (llvm#196801) The combine introduced in 55aff64 lowers scalar i128 compares into vector compares by reissuing the original loads as v16i8 loads. However, the combine was reusing the original MachineMemOperand without modification. If the original i128 load carries !range metadata, the MMO encodes that range using i128 values. Reusing this MMO for a v16i8 load is incorrect as range metadata is only valid for integer scalar types and its bitwidth must match the memory VT. This patch fixes this by creating a new MachineMemOperand for the vector vector load. Additionally, we restrict the combine for constant operands to avoid cases that are better handled by scalar lowering. Small constants (fit within 16 bits) are excluded to prevent generating suboptimal vector compares.
…a and small constants (llvm#196801) The combine introduced in 55aff64 lowers scalar i128 compares into vector compares by reissuing the original loads as v16i8 loads. However, the combine was reusing the original MachineMemOperand without modification. If the original i128 load carries !range metadata, the MMO encodes that range using i128 values. Reusing this MMO for a v16i8 load is incorrect as range metadata is only valid for integer scalar types and its bitwidth must match the memory VT. This patch fixes this by creating a new MachineMemOperand for the vector vector load. Additionally, we restrict the combine for constant operands to avoid cases that are better handled by scalar lowering. Small constants (fit within 16 bits) are excluded to prevent generating suboptimal vector compares. (cherry picked from commit 1907b58)
…a and small constants (llvm#196801) The combine introduced in 55aff64 lowers scalar i128 compares into vector compares by reissuing the original loads as v16i8 loads. However, the combine was reusing the original MachineMemOperand without modification. If the original i128 load carries !range metadata, the MMO encodes that range using i128 values. Reusing this MMO for a v16i8 load is incorrect as range metadata is only valid for integer scalar types and its bitwidth must match the memory VT. This patch fixes this by creating a new MachineMemOperand for the vector vector load. Additionally, we restrict the combine for constant operands to avoid cases that are better handled by scalar lowering. Small constants (fit within 16 bits) are excluded to prevent generating suboptimal vector compares. (cherry picked from commit 1907b58)
The combine introduced in 55aff64 lowers scalar i128 compares into vector compares by reissuing the original loads as v16i8 loads. However, the combine was reusing the original MachineMemOperand without modification.
If the original i128 load carries !range metadata, the MMO encodes that range using i128 values. Reusing this MMO for a v16i8 load is incorrect as range metadata is only valid for integer scalar types and its bitwidth must match the memory VT.
This patch fixes this by creating a new MachineMemOperand for the vector vector load. Additionally, we restrict the combine for constant operands to avoid cases that are better handled by scalar lowering. Small constants (fit within 16 bits) are excluded to prevent generating suboptimal vector compares.