Skip to content

[SLP] Enable full non-power-of-2 vectorization by default#196825

Merged
alexey-bataev merged 5 commits into
mainfrom
users/alexey-bataev/spr/slp-enable-full-non-power-of-2-vectorization-by-default
May 17, 2026
Merged

[SLP] Enable full non-power-of-2 vectorization by default#196825
alexey-bataev merged 5 commits into
mainfrom
users/alexey-bataev/spr/slp-enable-full-non-power-of-2-vectorization-by-default

Conversation

@alexey-bataev

Copy link
Copy Markdown
Member

Default slp-vectorize-non-power-of-2 to true and broaden the set of
supported widths beyond NumElts + 1 == bit_ceil(NumElts) to include
small widths (<= 5), widths where NumElts - 1 is also non-power of two
(e.g. 6, 7, 10..15), and any width when the elements being vectorized
are themselves vectors (REVEC). Tweak gathered loads, stores, and
reduction support to the non-power-of-2 vector factors.

Created using spr 1.3.7
@llvmorg-github-actions

llvmorg-github-actions Bot commented May 10, 2026

Copy link
Copy Markdown

@llvm/pr-subscribers-backend-webassembly
@llvm/pr-subscribers-llvm-transforms
@llvm/pr-subscribers-vectorizers

@llvm/pr-subscribers-backend-risc-v

Author: Alexey Bataev (alexey-bataev)

Changes

Default slp-vectorize-non-power-of-2 to true and broaden the set of
supported widths beyond NumElts + 1 == bit_ceil(NumElts) to include
small widths (<= 5), widths where NumElts - 1 is also non-power of two
(e.g. 6, 7, 10..15), and any width when the elements being vectorized
are themselves vectors (REVEC). Tweak gathered loads, stores, and
reduction support to the non-power-of-2 vector factors.


Patch is 149.68 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/196825.diff

37 Files Affected:

  • (modified) llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp (+65-14)
  • (modified) llvm/test/Transforms/PhaseOrdering/AArch64/reduce_submuladd.ll (+31-98)
  • (modified) llvm/test/Transforms/PhaseOrdering/X86/vector-reductions.ll (+2-5)
  • (modified) llvm/test/Transforms/SLPVectorizer/AArch64/reduce-fadd.ll (+2-5)
  • (modified) llvm/test/Transforms/SLPVectorizer/AArch64/scalable-vector.ll (+1-1)
  • (modified) llvm/test/Transforms/SLPVectorizer/AArch64/trunc-insertion.ll (+9-11)
  • (modified) llvm/test/Transforms/SLPVectorizer/AArch64/tsc-s352.ll (+18-11)
  • (modified) llvm/test/Transforms/SLPVectorizer/AArch64/vectorizable-selects-uniform-cmps.ll (+6-21)
  • (modified) llvm/test/Transforms/SLPVectorizer/RISCV/horizontal-list.ll (+3-41)
  • (modified) llvm/test/Transforms/SLPVectorizer/RISCV/reduced-value-repeated-and-vectorized.ll (+4-7)
  • (modified) llvm/test/Transforms/SLPVectorizer/RISCV/revec.ll (+24-60)
  • (modified) llvm/test/Transforms/SLPVectorizer/SystemZ/minbitwidth-trunc.ll (+5-9)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/bool-mask.ll (+5-11)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/buildvector-store-chains.ll (+5-7)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/compare-reduce.ll (+4-6)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/dot-product.ll (+137-42)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/entries-shuffled-diff-sizes.ll (+7-7)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/extracts-non-extendable.ll (+1-2)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/gather-extractelements-different-bbs.ll (+9-8)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/horizontal-list.ll (+44-56)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/horizontal-minmax.ll (+140-76)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/multi-use-bitcasted-reduction.ll (+3-6)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/multi_user.ll (+6-11)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll (+6-11)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/parent-node-split-non-schedulable.ll (+5-7)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/reduced-val-vectorized-in-transform.ll (+4-6)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/reduction-bool-logic-op-inside.ll (+8-9)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/reduction-logical.ll (+34-23)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/reduction-same-vals.ll (+3-9)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/rgb_phi.ll (+14-20)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/schedule-bundle.ll (+4-8)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/select-reduction-op.ll (+4-8)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/vec3-base.ll (+4-12)
  • (modified) llvm/test/Transforms/SLPVectorizer/bool-logical-op-reduction-with-poison.ll (+9-16)
  • (modified) llvm/test/Transforms/SLPVectorizer/logical-ops-poisonous-repeated.ll (+6-9)
  • (modified) llvm/test/Transforms/SLPVectorizer/partial-register-extract.ll (+11-22)
  • (modified) llvm/test/Transforms/SLPVectorizer/reduced-gathered-vectorized.ll (+2-8)
diff --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
index f1a6eb2d7e8af..f1d7316a869bf 100644
--- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
@@ -229,7 +229,7 @@ static cl::opt<bool>
                 cl::desc("Display the SLP trees with Graphviz"));
 
 static cl::opt<bool> VectorizeNonPowerOf2(
-    "slp-vectorize-non-power-of-2", cl::init(false), cl::Hidden,
+    "slp-vectorize-non-power-of-2", cl::init(true), cl::Hidden,
     cl::desc("Try to vectorize with non-power-of-2 number of elements."));
 
 static cl::opt<bool> ForcePostProcessStoresOperands(
@@ -242,10 +242,14 @@ static cl::opt<bool> NonVectReductions(
         "Use  non-vectorizable instructions as potential reduction roots."));
 
 /// True when \p slp-vectorize-non-power-of-2 is enabled and \p NumElts is a
-/// supported non-power-of-2 width: \p NumElts + 1 must be a power of two
-/// (e.g. 3 or 7 lanes, i.e. almost a full power-of-2 register).
-static bool isAllowedNonPowerOf2VF(unsigned NumElts) {
-  return VectorizeNonPowerOf2 && has_single_bit(NumElts + 1);
+/// supported non-power-of-2 width. The width is supported if \p NumElts is not
+/// a power of two and either it is small (<= 5, e.g. 3 or 5 lanes), or
+/// \p NumElts - 1 is also not a power of two (e.g. 6, 7, 10..15 lanes), or
+/// the elements being vectorized are themselves vectors (REVEC).
+static bool isAllowedNonPowerOf2VF(unsigned NumElts, bool IsVectorElement) {
+  return VectorizeNonPowerOf2 && !has_single_bit(NumElts) &&
+         ((SLPReVec && IsVectorElement) || NumElts <= 5 ||
+          !has_single_bit(NumElts - 1));
 }
 
 /// Enables vectorization of copyable elements.
@@ -9984,8 +9988,13 @@ void BoUpSLP::tryToVectorizeGatheredLoads(
     SmallVector<std::pair<ArrayRef<Value *>, LoadsState>> Results;
     unsigned StartIdx = 0;
     SmallVector<int> CandidateVFs;
-    if (isAllowedNonPowerOf2VF(MaxVF))
-      CandidateVFs.push_back(MaxVF);
+    if (isAllowedNonPowerOf2VF(
+            MaxVF, isa<FixedVectorType>(Loads.front()->getType()))) {
+      const unsigned FullVectorNumElements = getFullVectorNumberOfElements(
+          *TTI, Loads.front()->getType(), MaxVF - 1);
+      if (MaxVF >= 3 && FullVectorNumElements != MaxVF - 1)
+        CandidateVFs.push_back(MaxVF);
+    }
     for (int NumElts = getFloorFullVectorNumberOfElements(
              *TTI, Loads.front()->getType(), MaxVF);
          NumElts > 1; NumElts = getFloorFullVectorNumberOfElements(
@@ -26711,9 +26720,11 @@ SLPVectorizerPass::vectorizeStoreChain(ArrayRef<Value *> Chain, BoUpSLP &R,
       Analysis.buildInstructionsState(ValOps.getArrayRef(), R);
   if (all_of(ValOps, IsaPred<Instruction>) && ValOps.size() > 1) {
     DenseSet<Value *> Stores(Chain.begin(), Chain.end());
-    bool IsAllowedSize = hasFullVectorsOrPowerOf2(
-                             *TTI, ValOps.front()->getType(), ValOps.size()) ||
-                         isAllowedNonPowerOf2VF(ValOps.size());
+    bool IsAllowedSize =
+        hasFullVectorsOrPowerOf2(*TTI, ValOps.front()->getType(),
+                                 ValOps.size()) ||
+        isAllowedNonPowerOf2VF(ValOps.size(),
+                               isa<FixedVectorType>(ValOps.front()->getType()));
     if ((!IsAllowedSize && S && S.getOpcode() != Instruction::Load &&
          (!S.getMainOp()->isSafeToRemove() ||
           any_of(ValOps.getArrayRef(),
@@ -26996,7 +27007,7 @@ bool StoreChainContext::initializeContext(
   // First try a supported non-power-of-2 VF (see isAllowedNonPowerOf2VF).
   unsigned NonPowerOf2VF = 0;
   unsigned CandVF = std::clamp<unsigned>(Operands.size(), MinVF, MaxVF);
-  if (isAllowedNonPowerOf2VF(CandVF)) {
+  if (isAllowedNonPowerOf2VF(CandVF, isa<FixedVectorType>(StoreTy))) {
     NonPowerOf2VF = CandVF;
     assert(NonPowerOf2VF != MaxVF &&
            "Non-power-of-2 VF should not be equal to MaxVF");
@@ -27013,7 +27024,7 @@ bool StoreChainContext::initializeContext(
   }
 
   for (unsigned VF = std::max(MaxVF, NonPowerOf2VF); VF >= MinVF;
-       VF = divideCeil(VF, 2))
+       VF = getFullVectorNumberOfElements(TTI, StoreTy, divideCeil(VF, 2)))
     CandidateVFs.push(VF);
 
   End = Operands.size();
@@ -28520,6 +28531,8 @@ class HorizontalReduction {
     // Try merge consecutive reduced values into a single vectorizable group and
     // check, if they can be vectorized as copyables.
     const bool TwoGroupsOnly = ReducedVals.size() == 2;
+    const bool LastOfTwoGroupsIsSingle =
+        TwoGroupsOnly && ReducedVals.back().size() == 1;
     const bool TwoGroupsOfSameSmallSize =
         TwoGroupsOnly &&
         ReducedVals.front().size() == ReducedVals.back().size() &&
@@ -28739,8 +28752,46 @@ class HorizontalReduction {
           ReduxWidth = bit_floor(ReduxWidth);
         return ReduxWidth;
       };
-      if (!isAllowedNonPowerOf2VF(ReduxWidth))
-        ReduxWidth = GetVectorFactor(ReduxWidth);
+      const unsigned FullRegReduxWidth = GetVectorFactor(ReduxWidth);
+      bool AllowNoPowerOf2 = false;
+      if (isAllowedNonPowerOf2VF(
+              ReduxWidth,
+              isa<FixedVectorType>(Candidates.front()->getType()))) {
+        // For a 5-wide reduction merged from two groups (4 elements plus a
+        // single trailing value) via copyable analysis, refuse the non-power
+        // of-2 width when the lone trailing value does not fit the main-op
+        // operand pattern. Such a mismatch makes a 5-wide vector wasteful
+        // compared to a 4-wide + scalar tail.
+        const unsigned SmallReductionNonPowerOf2 = 5;
+        auto LoneValueMismatchesMainOpOperands = [&]() {
+          Value *LastVal = ReducedVals.back().back();
+          if (!isa<Instruction>(LastVal))
+            return any_of(S.getMainOp()->operand_values(),
+                          IsaPred<Instruction>);
+          unsigned LastOpcode = cast<Instruction>(LastVal)->getOpcode();
+          return none_of(S.getMainOp()->operand_values(), [&](Value *Op) {
+            auto *I = dyn_cast<Instruction>(Op);
+            return I && I->getOpcode() == LastOpcode;
+          });
+        };
+        if (ReduxWidth == ReductionLimit) {
+          AllowNoPowerOf2 = true;
+        } else if (ReduxWidth == SmallReductionNonPowerOf2 && TwoGroupsOnly &&
+                   LastOfTwoGroupsIsSingle && S &&
+                   S.areInstructionsWithCopyableElements() &&
+                   LoneValueMismatchesMainOpOperands()) {
+          AllowNoPowerOf2 = false;
+        } else if (S && !S.isAltShuffle()) {
+          AllowNoPowerOf2 = true;
+        } else {
+          InstructionsState OpS =
+              getSameOpcode(ArrayRef(Candidates).slice(FullRegReduxWidth), TLI);
+          if (!OpS || OpS.isAltShuffle())
+            AllowNoPowerOf2 = true;
+        }
+      }
+      if (!AllowNoPowerOf2)
+        ReduxWidth = FullRegReduxWidth;
       ReduxWidth = std::min(ReduxWidth, MaxElts);
 
       unsigned Start = 0;
diff --git a/llvm/test/Transforms/PhaseOrdering/AArch64/reduce_submuladd.ll b/llvm/test/Transforms/PhaseOrdering/AArch64/reduce_submuladd.ll
index fe0aaf9d80195..8945e32d42715 100644
--- a/llvm/test/Transforms/PhaseOrdering/AArch64/reduce_submuladd.ll
+++ b/llvm/test/Transforms/PhaseOrdering/AArch64/reduce_submuladd.ll
@@ -9,105 +9,38 @@ target triple = "aarch64"
 define dso_local noundef nofpclass(nan inf) float @_Z4testPKfS0_ii(ptr noundef %0, ptr noundef %1, i32 noundef %2, i32 noundef %3) {
 ; CHECK-LABEL: define dso_local noundef nofpclass(nan inf) float @_Z4testPKfS0_ii
 ; CHECK-SAME: (ptr noundef readonly captures(none) [[TMP0:%.*]], ptr noundef readonly captures(none) [[TMP1:%.*]], i32 noundef [[TMP2:%.*]], i32 noundef [[TMP3:%.*]]) local_unnamed_addr #[[ATTR0:[0-9]+]] {
-; CHECK-NEXT:  .preheader.i:
-; CHECK-NEXT:    [[TMP4:%.*]] = sext i32 [[TMP3]] to i64
 ; CHECK-NEXT:    [[TMP5:%.*]] = sext i32 [[TMP2]] to i64
-; CHECK-NEXT:    [[TMP6:%.*]] = load <20 x float>, ptr [[TMP0]], align 4, !tbaa [[TBAA4:![0-9]+]]
-; CHECK-NEXT:    [[TMP7:%.*]] = load <20 x float>, ptr [[TMP1]], align 4, !tbaa [[TBAA4]]
-; CHECK-NEXT:    [[TMP8:%.*]] = fsub fast <20 x float> [[TMP6]], [[TMP7]]
-; CHECK-NEXT:    [[TMP9:%.*]] = fmul fast <20 x float> [[TMP8]], [[TMP8]]
-; CHECK-NEXT:    [[TMP10:%.*]] = getelementptr inbounds nuw i8, ptr [[TMP0]], i64 80
-; CHECK-NEXT:    [[TMP11:%.*]] = load float, ptr [[TMP10]], align 4, !tbaa [[TBAA4]]
-; CHECK-NEXT:    [[TMP12:%.*]] = getelementptr inbounds nuw i8, ptr [[TMP1]], i64 80
-; CHECK-NEXT:    [[TMP13:%.*]] = load float, ptr [[TMP12]], align 4, !tbaa [[TBAA4]]
-; CHECK-NEXT:    [[TMP14:%.*]] = fsub fast float [[TMP11]], [[TMP13]]
-; CHECK-NEXT:    [[TMP15:%.*]] = fmul fast float [[TMP14]], [[TMP14]]
-; CHECK-NEXT:    [[TMP16:%.*]] = getelementptr inbounds [4 x i8], ptr [[TMP0]], i64 [[TMP5]]
-; CHECK-NEXT:    [[TMP17:%.*]] = getelementptr inbounds [4 x i8], ptr [[TMP1]], i64 [[TMP4]]
-; CHECK-NEXT:    [[OP_RDX:%.*]] = tail call fast float @llvm.vector.reduce.fadd.v20f32(float [[TMP15]], <20 x float> [[TMP9]])
-; CHECK-NEXT:    [[TMP18:%.*]] = load <20 x float>, ptr [[TMP16]], align 4, !tbaa [[TBAA4]]
-; CHECK-NEXT:    [[TMP19:%.*]] = load <20 x float>, ptr [[TMP17]], align 4, !tbaa [[TBAA4]]
-; CHECK-NEXT:    [[TMP20:%.*]] = fsub fast <20 x float> [[TMP18]], [[TMP19]]
-; CHECK-NEXT:    [[TMP21:%.*]] = fmul fast <20 x float> [[TMP20]], [[TMP20]]
-; CHECK-NEXT:    [[TMP22:%.*]] = getelementptr inbounds nuw i8, ptr [[TMP16]], i64 80
-; CHECK-NEXT:    [[TMP23:%.*]] = load float, ptr [[TMP22]], align 4, !tbaa [[TBAA4]]
-; CHECK-NEXT:    [[TMP24:%.*]] = getelementptr inbounds nuw i8, ptr [[TMP17]], i64 80
-; CHECK-NEXT:    [[TMP25:%.*]] = load float, ptr [[TMP24]], align 4, !tbaa [[TBAA4]]
-; CHECK-NEXT:    [[TMP26:%.*]] = fsub fast float [[TMP23]], [[TMP25]]
-; CHECK-NEXT:    [[TMP27:%.*]] = fmul fast float [[TMP26]], [[TMP26]]
-; CHECK-NEXT:    [[TMP28:%.*]] = getelementptr inbounds [4 x i8], ptr [[TMP16]], i64 [[TMP5]]
-; CHECK-NEXT:    [[TMP29:%.*]] = getelementptr inbounds [4 x i8], ptr [[TMP17]], i64 [[TMP4]]
-; CHECK-NEXT:    [[OP_RDX_1:%.*]] = tail call fast float @llvm.vector.reduce.fadd.v20f32(float [[TMP27]], <20 x float> [[TMP21]])
-; CHECK-NEXT:    [[OP_RDX3_1:%.*]] = fadd fast float [[OP_RDX_1]], [[OP_RDX]]
-; CHECK-NEXT:    [[TMP30:%.*]] = load <20 x float>, ptr [[TMP28]], align 4, !tbaa [[TBAA4]]
-; CHECK-NEXT:    [[TMP31:%.*]] = load <20 x float>, ptr [[TMP29]], align 4, !tbaa [[TBAA4]]
-; CHECK-NEXT:    [[TMP32:%.*]] = fsub fast <20 x float> [[TMP30]], [[TMP31]]
-; CHECK-NEXT:    [[TMP33:%.*]] = fmul fast <20 x float> [[TMP32]], [[TMP32]]
-; CHECK-NEXT:    [[TMP34:%.*]] = getelementptr inbounds nuw i8, ptr [[TMP28]], i64 80
-; CHECK-NEXT:    [[TMP35:%.*]] = load float, ptr [[TMP34]], align 4, !tbaa [[TBAA4]]
-; CHECK-NEXT:    [[TMP36:%.*]] = getelementptr inbounds nuw i8, ptr [[TMP29]], i64 80
-; CHECK-NEXT:    [[TMP37:%.*]] = load float, ptr [[TMP36]], align 4, !tbaa [[TBAA4]]
-; CHECK-NEXT:    [[TMP38:%.*]] = fsub fast float [[TMP35]], [[TMP37]]
-; CHECK-NEXT:    [[TMP39:%.*]] = fmul fast float [[TMP38]], [[TMP38]]
-; CHECK-NEXT:    [[TMP40:%.*]] = getelementptr inbounds [4 x i8], ptr [[TMP28]], i64 [[TMP5]]
-; CHECK-NEXT:    [[TMP41:%.*]] = getelementptr inbounds [4 x i8], ptr [[TMP29]], i64 [[TMP4]]
-; CHECK-NEXT:    [[OP_RDX_2:%.*]] = tail call fast float @llvm.vector.reduce.fadd.v20f32(float [[TMP39]], <20 x float> [[TMP33]])
-; CHECK-NEXT:    [[OP_RDX3_2:%.*]] = fadd fast float [[OP_RDX_2]], [[OP_RDX3_1]]
-; CHECK-NEXT:    [[TMP42:%.*]] = load <20 x float>, ptr [[TMP40]], align 4, !tbaa [[TBAA4]]
-; CHECK-NEXT:    [[TMP43:%.*]] = load <20 x float>, ptr [[TMP41]], align 4, !tbaa [[TBAA4]]
-; CHECK-NEXT:    [[TMP44:%.*]] = fsub fast <20 x float> [[TMP42]], [[TMP43]]
-; CHECK-NEXT:    [[TMP45:%.*]] = fmul fast <20 x float> [[TMP44]], [[TMP44]]
-; CHECK-NEXT:    [[TMP46:%.*]] = getelementptr inbounds nuw i8, ptr [[TMP40]], i64 80
-; CHECK-NEXT:    [[TMP47:%.*]] = load float, ptr [[TMP46]], align 4, !tbaa [[TBAA4]]
-; CHECK-NEXT:    [[TMP48:%.*]] = getelementptr inbounds nuw i8, ptr [[TMP41]], i64 80
-; CHECK-NEXT:    [[TMP49:%.*]] = load float, ptr [[TMP48]], align 4, !tbaa [[TBAA4]]
-; CHECK-NEXT:    [[TMP50:%.*]] = fsub fast float [[TMP47]], [[TMP49]]
-; CHECK-NEXT:    [[TMP51:%.*]] = fmul fast float [[TMP50]], [[TMP50]]
-; CHECK-NEXT:    [[TMP52:%.*]] = getelementptr inbounds [4 x i8], ptr [[TMP40]], i64 [[TMP5]]
-; CHECK-NEXT:    [[TMP53:%.*]] = getelementptr inbounds [4 x i8], ptr [[TMP41]], i64 [[TMP4]]
-; CHECK-NEXT:    [[OP_RDX_3:%.*]] = tail call fast float @llvm.vector.reduce.fadd.v20f32(float [[TMP51]], <20 x float> [[TMP45]])
-; CHECK-NEXT:    [[OP_RDX3_3:%.*]] = fadd fast float [[OP_RDX_3]], [[OP_RDX3_2]]
-; CHECK-NEXT:    [[TMP54:%.*]] = load <20 x float>, ptr [[TMP52]], align 4, !tbaa [[TBAA4]]
-; CHECK-NEXT:    [[TMP55:%.*]] = load <20 x float>, ptr [[TMP53]], align 4, !tbaa [[TBAA4]]
-; CHECK-NEXT:    [[TMP56:%.*]] = fsub fast <20 x float> [[TMP54]], [[TMP55]]
-; CHECK-NEXT:    [[TMP57:%.*]] = fmul fast <20 x float> [[TMP56]], [[TMP56]]
-; CHECK-NEXT:    [[TMP58:%.*]] = getelementptr inbounds nuw i8, ptr [[TMP52]], i64 80
-; CHECK-NEXT:    [[TMP59:%.*]] = load float, ptr [[TMP58]], align 4, !tbaa [[TBAA4]]
-; CHECK-NEXT:    [[TMP60:%.*]] = getelementptr inbounds nuw i8, ptr [[TMP53]], i64 80
-; CHECK-NEXT:    [[TMP61:%.*]] = load float, ptr [[TMP60]], align 4, !tbaa [[TBAA4]]
-; CHECK-NEXT:    [[TMP62:%.*]] = fsub fast float [[TMP59]], [[TMP61]]
-; CHECK-NEXT:    [[TMP63:%.*]] = fmul fast float [[TMP62]], [[TMP62]]
-; CHECK-NEXT:    [[TMP64:%.*]] = getelementptr inbounds [4 x i8], ptr [[TMP52]], i64 [[TMP5]]
-; CHECK-NEXT:    [[TMP65:%.*]] = getelementptr inbounds [4 x i8], ptr [[TMP53]], i64 [[TMP4]]
-; CHECK-NEXT:    [[OP_RDX_4:%.*]] = tail call fast float @llvm.vector.reduce.fadd.v20f32(float [[TMP63]], <20 x float> [[TMP57]])
-; CHECK-NEXT:    [[OP_RDX3_4:%.*]] = fadd fast float [[OP_RDX_4]], [[OP_RDX3_3]]
-; CHECK-NEXT:    [[TMP66:%.*]] = load <20 x float>, ptr [[TMP64]], align 4, !tbaa [[TBAA4]]
-; CHECK-NEXT:    [[TMP67:%.*]] = load <20 x float>, ptr [[TMP65]], align 4, !tbaa [[TBAA4]]
-; CHECK-NEXT:    [[TMP68:%.*]] = fsub fast <20 x float> [[TMP66]], [[TMP67]]
-; CHECK-NEXT:    [[TMP69:%.*]] = fmul fast <20 x float> [[TMP68]], [[TMP68]]
-; CHECK-NEXT:    [[TMP70:%.*]] = getelementptr inbounds nuw i8, ptr [[TMP64]], i64 80
-; CHECK-NEXT:    [[TMP71:%.*]] = load float, ptr [[TMP70]], align 4, !tbaa [[TBAA4]]
-; CHECK-NEXT:    [[TMP72:%.*]] = getelementptr inbounds nuw i8, ptr [[TMP65]], i64 80
-; CHECK-NEXT:    [[TMP73:%.*]] = load float, ptr [[TMP72]], align 4, !tbaa [[TBAA4]]
-; CHECK-NEXT:    [[TMP74:%.*]] = fsub fast float [[TMP71]], [[TMP73]]
-; CHECK-NEXT:    [[TMP75:%.*]] = fmul fast float [[TMP74]], [[TMP74]]
-; CHECK-NEXT:    [[TMP76:%.*]] = getelementptr inbounds [4 x i8], ptr [[TMP64]], i64 [[TMP5]]
-; CHECK-NEXT:    [[TMP77:%.*]] = getelementptr inbounds [4 x i8], ptr [[TMP65]], i64 [[TMP4]]
-; CHECK-NEXT:    [[OP_RDX_5:%.*]] = tail call fast float @llvm.vector.reduce.fadd.v20f32(float [[TMP75]], <20 x float> [[TMP69]])
-; CHECK-NEXT:    [[OP_RDX3_5:%.*]] = fadd fast float [[OP_RDX_5]], [[OP_RDX3_4]]
-; CHECK-NEXT:    [[TMP78:%.*]] = load <20 x float>, ptr [[TMP76]], align 4, !tbaa [[TBAA4]]
-; CHECK-NEXT:    [[TMP79:%.*]] = load <20 x float>, ptr [[TMP77]], align 4, !tbaa [[TBAA4]]
-; CHECK-NEXT:    [[TMP80:%.*]] = fsub fast <20 x float> [[TMP78]], [[TMP79]]
-; CHECK-NEXT:    [[TMP81:%.*]] = fmul fast <20 x float> [[TMP80]], [[TMP80]]
-; CHECK-NEXT:    [[TMP82:%.*]] = getelementptr inbounds nuw i8, ptr [[TMP76]], i64 80
-; CHECK-NEXT:    [[TMP83:%.*]] = load float, ptr [[TMP82]], align 4, !tbaa [[TBAA4]]
-; CHECK-NEXT:    [[TMP84:%.*]] = getelementptr inbounds nuw i8, ptr [[TMP77]], i64 80
-; CHECK-NEXT:    [[TMP85:%.*]] = load float, ptr [[TMP84]], align 4, !tbaa [[TBAA4]]
-; CHECK-NEXT:    [[TMP86:%.*]] = fsub fast float [[TMP83]], [[TMP85]]
-; CHECK-NEXT:    [[TMP87:%.*]] = fmul fast float [[TMP86]], [[TMP86]]
-; CHECK-NEXT:    [[OP_RDX_6:%.*]] = tail call fast float @llvm.vector.reduce.fadd.v20f32(float [[TMP87]], <20 x float> [[TMP81]])
-; CHECK-NEXT:    [[OP_RDX3_6:%.*]] = fadd fast float [[OP_RDX_6]], [[OP_RDX3_5]]
-; CHECK-NEXT:    ret float [[OP_RDX3_6]]
+; CHECK-NEXT:    [[TMP6:%.*]] = sext i32 [[TMP3]] to i64
+; CHECK-NEXT:    br label [[DOTPREHEADER_I:%.*]]
+; CHECK:       .preheader.i:
+; CHECK-NEXT:    [[DOT027_I:%.*]] = phi ptr [ [[TMP0]], [[TMP4:%.*]] ], [ [[TMP23:%.*]], [[DOTPREHEADER_I]] ]
+; CHECK-NEXT:    [[DOT01926_I:%.*]] = phi i32 [ 0, [[TMP4]] ], [ [[TMP26:%.*]], [[DOTPREHEADER_I]] ]
+; CHECK-NEXT:    [[DOT02025_I:%.*]] = phi float [ 0.000000e+00, [[TMP4]] ], [ [[TMP25:%.*]], [[DOTPREHEADER_I]] ]
+; CHECK-NEXT:    [[DOT02124_I:%.*]] = phi ptr [ [[TMP1]], [[TMP4]] ], [ [[TMP24:%.*]], [[DOTPREHEADER_I]] ]
+; CHECK-NEXT:    [[TMP7:%.*]] = getelementptr inbounds nuw i8, ptr [[DOT027_I]], i64 80
+; CHECK-NEXT:    [[TMP8:%.*]] = load float, ptr [[TMP7]], align 4, !tbaa [[TBAA4:![0-9]+]]
+; CHECK-NEXT:    [[TMP9:%.*]] = getelementptr inbounds nuw i8, ptr [[DOT02124_I]], i64 80
+; CHECK-NEXT:    [[TMP10:%.*]] = load float, ptr [[TMP9]], align 4, !tbaa [[TBAA4]]
+; CHECK-NEXT:    [[TMP11:%.*]] = load <20 x float>, ptr [[DOT027_I]], align 4, !tbaa [[TBAA4]]
+; CHECK-NEXT:    [[TMP12:%.*]] = load <20 x float>, ptr [[DOT02124_I]], align 4, !tbaa [[TBAA4]]
+; CHECK-NEXT:    [[TMP13:%.*]] = insertelement <22 x float> poison, float [[TMP8]], i64 20
+; CHECK-NEXT:    [[TMP14:%.*]] = insertelement <22 x float> [[TMP13]], float [[DOT02025_I]], i64 21
+; CHECK-NEXT:    [[TMP15:%.*]] = shufflevector <20 x float> [[TMP11]], <20 x float> poison, <22 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 poison, i32 poison>
+; CHECK-NEXT:    [[TMP16:%.*]] = shufflevector <22 x float> [[TMP15]], <22 x float> [[TMP14]], <22 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 42, i32 43>
+; CHECK-NEXT:    [[TMP17:%.*]] = insertelement <22 x float> <float poison, float poison, float poison, float poison, float poison, float poison, float poison, float poison, float poison, float poison, float poison, float poison, float poison, float poison, float poison, float poison, float poison, float poison, float poison, float poison, float poison, float 0.000000e+00>, float [[TMP10]], i64 20
+; CHECK-NEXT:    [[TMP18:%.*]] = shufflevector <20 x float> [[TMP12]], <20 x float> poison, <22 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 poison, i32 poison>
+; CHECK-NEXT:    [[TMP19:%.*]] = shufflevector <22 x float> [[TMP18]], <22 x float> [[TMP17]], <22 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 42, i32 43>
+; CHECK-NEXT:    [[TMP20:%.*]] = fsub <22 x float> [[TMP16]], [[TMP19]]
+; CHECK-NEXT:    [[TMP21:%.*]] = insertelement <22 x float> [[TMP20]], float 1.000000e+00, i64 21
+; CHECK-NEXT:    [[TMP22:%.*]] = fmul <22 x float> [[TMP20]], [[TMP21]]
+; CHECK-NEXT:    [[TMP23]] = getelementptr inbounds [4 x i8], ptr [[DOT027_I]], i64 [[TMP5]]
+; CHECK-NEXT:    [[TMP24]] = getelementptr inbounds [4 x i8], ptr [[DOT02124_I]], i64 [[TMP6]]
+; CHECK-NEXT:    [[TMP25]] = tail call fast float @llvm.vector.reduce.fadd.v22f32(float 0.000000e+00, <22 x float> [[TMP22]])
+; CHECK-NEXT:    [[TMP26]] = add nuw nsw i32 [[DOT01926_I]], 1
+; CHECK-NEXT:    [[EXITCOND_NOT_I:%.*]] = icmp eq i32 [[TMP26]], 7
+; CHECK-NEXT:    br i1 [[EXITCOND_NOT_I]], label [[_ZL6REDUCEILI7EEFPKFS1_II_EXIT:%.*]], label [[DOTPREHEADER_I]], !llvm.loop [[LOOP8:![0-9]+]]
+; CHECK:       _ZL6reduceILi7EEfPKfS1_ii.exit:
+; CHECK-NEXT:    ret float [[TMP25]]
 ;
   %5 = alloca ptr, align 8
   %6 = alloca ptr, align 8
diff --git a/llvm/test/Transforms/PhaseOrdering/X86/vector-reductions.ll b/llvm/test/Transforms/PhaseOrdering/X86/vector-reductions.ll
index 2b38cfe7f21bd..c3464a21466de 100644
--- a/llvm/test/Transforms/PhaseOrdering/X86/vector-reductions.ll
+++ b/llvm/test/Transforms/PhaseOrdering/X86/vector-reductions.ll
@@ ...
[truncated]

@github-actions

github-actions Bot commented May 10, 2026

Copy link
Copy Markdown

⚠️ undef deprecator found issues in your code. ⚠️

You can test this locally with the following command:
git diff -U0 --pickaxe-regex -S '([^a-zA-Z0-9#_-]undef([^a-zA-Z0-9_-]|$)|UndefValue::get)' 'HEAD~1' HEAD llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp llvm/test/CodeGen/WebAssembly/slp-memory-interleave.ll llvm/test/Transforms/PhaseOrdering/AArch64/reduce_submuladd.ll llvm/test/Transforms/PhaseOrdering/X86/vector-reductions.ll llvm/test/Transforms/SLPVectorizer/AArch64/reduce-fadd.ll llvm/test/Transforms/SLPVectorizer/AArch64/scalable-vector.ll llvm/test/Transforms/SLPVectorizer/AArch64/trunc-insertion.ll llvm/test/Transforms/SLPVectorizer/AArch64/tsc-s352.ll llvm/test/Transforms/SLPVectorizer/AArch64/vectorizable-selects-uniform-cmps.ll llvm/test/Transforms/SLPVectorizer/RISCV/horizontal-list.ll llvm/test/Transforms/SLPVectorizer/RISCV/reduced-value-repeated-and-vectorized.ll llvm/test/Transforms/SLPVectorizer/RISCV/revec.ll llvm/test/Transforms/SLPVectorizer/SystemZ/minbitwidth-trunc.ll llvm/test/Transforms/SLPVectorizer/X86/bool-mask.ll llvm/test/Transforms/SLPVectorizer/X86/buildvector-store-chains.ll llvm/test/Transforms/SLPVectorizer/X86/compare-reduce.ll llvm/test/Transforms/SLPVectorizer/X86/dot-product.ll llvm/test/Transforms/SLPVectorizer/X86/entries-shuffled-diff-sizes.ll llvm/test/Transforms/SLPVectorizer/X86/extracts-non-extendable.ll llvm/test/Transforms/SLPVectorizer/X86/gather-extractelements-different-bbs.ll llvm/test/Transforms/SLPVectorizer/X86/horizontal-list.ll llvm/test/Transforms/SLPVectorizer/X86/horizontal-minmax.ll llvm/test/Transforms/SLPVectorizer/X86/multi-use-bitcasted-reduction.ll llvm/test/Transforms/SLPVectorizer/X86/multi_user.ll llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll llvm/test/Transforms/SLPVectorizer/X86/parent-node-split-non-schedulable.ll llvm/test/Transforms/SLPVectorizer/X86/reduced-val-vectorized-in-transform.ll llvm/test/Transforms/SLPVectorizer/X86/reduction-bool-logic-op-inside.ll llvm/test/Transforms/SLPVectorizer/X86/reduction-logical.ll llvm/test/Transforms/SLPVectorizer/X86/reduction-same-vals.ll llvm/test/Transforms/SLPVectorizer/X86/rgb_phi.ll llvm/test/Transforms/SLPVectorizer/X86/schedule-bundle.ll llvm/test/Transforms/SLPVectorizer/X86/select-reduction-op.ll llvm/test/Transforms/SLPVectorizer/X86/vec3-base.ll llvm/test/Transforms/SLPVectorizer/bool-logical-op-reduction-with-poison.ll llvm/test/Transforms/SLPVectorizer/logical-ops-poisonous-repeated.ll llvm/test/Transforms/SLPVectorizer/partial-register-extract.ll llvm/test/Transforms/SLPVectorizer/reduced-gathered-vectorized.ll

The following files introduce new uses of undef:

  • llvm/test/Transforms/SLPVectorizer/X86/horizontal-minmax.ll

Undef is now deprecated and should only be used in the rare cases where no replacement is possible. For example, a load of uninitialized memory yields undef. You should use poison values for placeholders instead.

In tests, avoid using undef and having tests that trigger undefined behavior. If you need an operand with some unimportant value, you can add a new argument to the function and use that instead.

For example, this is considered a bad practice:

define void @fn() {
  ...
  br i1 undef, ...
}

Please use the following instead:

define void @fn(i1 %cond) {
  ...
  br i1 %cond, ...
}

Please refer to the Undefined Behavior Manual for more information.

@github-actions

github-actions Bot commented May 10, 2026

Copy link
Copy Markdown

🐧 Linux x64 Test Results

  • 195476 tests passed
  • 5216 tests skipped

✅ The build succeeded and all tests passed.

@github-actions

github-actions Bot commented May 10, 2026

Copy link
Copy Markdown

🪟 Windows x64 Test Results

  • 134827 tests passed
  • 3285 tests skipped

✅ The build succeeded and all tests passed.

@nikic

nikic commented May 10, 2026

Copy link
Copy Markdown
Contributor

Created using spr 1.3.7
} else {
VectorType *VecTy = ::getWidenedType(StoreTy, NonPowerOf2VF);
if (!SingleContext && CandVF == 3 &&
TTI.getMemoryOpCost(Instruction::Store, VecTy, Store->getAlign(),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just making note, this might need adjusted later for strided stores. Good for now.

Comment thread llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
Comment on lines +27465 to +27466
if (CtxPtr->initializeContext(R, *DL, *TTI, Visited,
AllContexts.size() == 1))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was going to merge #193616 once its compile time test finished, but I will hold off on that merge until after this because I suspect that change will break this metric.

Comment thread llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp Outdated
Comment thread llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

@bababuck bababuck left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (though seems to be code formatting issue)

Edit: Added node about code formatting

@alexey-bataev

Copy link
Copy Markdown
Member Author

LGTM (though seems to be code formatting issue)

Edit: Added node about code formatting

It is not formatting, it is undef used in the test

@bababuck

Copy link
Copy Markdown
Contributor

LGTM (though seems to be code formatting issue)
Edit: Added node about code formatting

It is not formatting, it is undef used in the test

Ugh it's annoying that that gets flagged :/

@RKSimon RKSimon left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clean up these magic numbers (3 and 5) - I'm assuming to stop random regressions?

Comment thread llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp Outdated
Comment thread llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp Outdated
Comment thread llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp Outdated
Comment thread llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp Outdated
@alexey-bataev

Copy link
Copy Markdown
Member Author

Can you clean up these magic numbers (3 and 5) - I'm assuming to stop random regressions?

Mostly compile time tuning

Created using spr 1.3.7
@RKSimon RKSimon self-requested a review May 13, 2026 11:44

@RKSimon RKSimon left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - but I'm expecting regression reports due to cost tables not handling some of these exotic types properly

@alexey-bataev

Copy link
Copy Markdown
Member Author

LGTM - but I'm expecting regression reports due to cost tables not handling some of these exotic types properly

I expect that types are legalized to correct power-of-2 types, so the table should handle the costs correctly

Created using spr 1.3.7
@alexey-bataev alexey-bataev merged commit aa2f124 into main May 17, 2026
9 of 10 checks passed
@alexey-bataev alexey-bataev deleted the users/alexey-bataev/spr/slp-enable-full-non-power-of-2-vectorization-by-default branch May 17, 2026 11:10
llvm-sync Bot pushed a commit to arm/arm-toolchain that referenced this pull request May 17, 2026
Default slp-vectorize-non-power-of-2 to true and broaden the set of
supported widths beyond NumElts + 1 == bit_ceil(NumElts) to include
small widths (<= 5), widths where NumElts - 1 is also non-power of two
(e.g. 6, 7, 10..15), and any width when the elements being vectorized
are themselves vectors (REVEC). Tweak gathered loads, stores, and
reduction support to the non-power-of-2 vector factors.

Reviewers: hiraditya, bababuck, RKSimon

Pull Request: llvm/llvm-project#196825
llvm-upstreamsync Bot pushed a commit to qualcomm/cpullvm-toolchain that referenced this pull request May 17, 2026
Default slp-vectorize-non-power-of-2 to true and broaden the set of
supported widths beyond NumElts + 1 == bit_ceil(NumElts) to include
small widths (<= 5), widths where NumElts - 1 is also non-power of two
(e.g. 6, 7, 10..15), and any width when the elements being vectorized
are themselves vectors (REVEC). Tweak gathered loads, stores, and
reduction support to the non-power-of-2 vector factors.

Reviewers: hiraditya, bababuck, RKSimon

Pull Request: llvm/llvm-project#196825
cpullvm-upstream-sync Bot pushed a commit to navaneethshan/cpullvm-toolchain-1 that referenced this pull request May 17, 2026
Default slp-vectorize-non-power-of-2 to true and broaden the set of
supported widths beyond NumElts + 1 == bit_ceil(NumElts) to include
small widths (<= 5), widths where NumElts - 1 is also non-power of two
(e.g. 6, 7, 10..15), and any width when the elements being vectorized
are themselves vectors (REVEC). Tweak gathered loads, stores, and
reduction support to the non-power-of-2 vector factors.

Reviewers: hiraditya, bababuck, RKSimon

Pull Request: llvm/llvm-project#196825
@davemgreen

Copy link
Copy Markdown
Contributor

Hi - I see this has been reverted but are you sure this is profitable as-is? We seemed to see a lot of issues from it - I didn't think that DAG handled non-power 2 cross block vectors very well. That has blocked me in the past from making the cost model better (more accurate, lower cost non-power2 vectors), as more SLP vectorization occurred that led to me scalarization.

@asb

asb commented May 18, 2026

Copy link
Copy Markdown
Contributor

FWIW this PR had been causing failures on the bootstrapping RVV bots since landing (at the same time, the buildbot web UI seems to have been mostly non-functional for me this morning so it took a little longer to dig in). Now it's been reverted as part of #198265 hopefully the bots will go green again.

I have this minimal reproducer:

target datalayout = "e-m:e-p:64:64-i64:64-i128:128-n32:64-S128"
target triple = "riscv64-unknown-linux-gnu"

define void @foo() #0 {
entry:
  %p = inttoptr i64 0 to ptr
  br label %loop

loop:
  %v0 = load i64, ptr %p, align 8
  %p16 = getelementptr i8, ptr %p, i64 16
  store i64 %v0, ptr %p16, align 8
  %p8 = getelementptr i8, ptr %p, i64 8
  %v1 = load i64, ptr %p8, align 8
  %p24 = getelementptr i8, ptr %p, i64 24
  %v2 = load i64, ptr %p24, align 8
  store i64 %v2, ptr %p8, align 8
  store i64 %v1, ptr %p, align 8
  ret void
}

attributes #0 = { "target-features"="+v" }

Which gives this error:

$ ./stage1/bin/opt -passes=slp-vectorizer -S -o - tc.ll
opt: /home/asb/slp-repro/llvm/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:1520: bool (anonymous namespace)::InstructionsState::isExpandedBinOp(Value *) const: Assertion `MainOp && "The instruction should be compatible with either main or alt op."' failed.
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace and instructions to reproduce the bug.
Stack dump:
0.	Program arguments: ./stage1/bin/opt -passes=slp-vectorizer -S -o - tc.ll
1.	Running pass "function(slp-vectorizer)" on module "tc.ll"
2.	Running pass "slp-vectorizer" on function "foo"
 #0 0x000056404e438076 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (./stage1/bin/opt+0x23b1076)
 #1 0x000056404e434c01 llvm::sys::RunSignalHandlers() (./stage1/bin/opt+0x23adc01)
 #2 0x000056404e4391e4 SignalHandler(int, siginfo_t*, void*) Signals.cpp:0:0
 #3 0x00007f790804d2d0 (/usr/lib/libc.so.6+0x3e2d0)
 #4 0x00007f79080a7a2c (/usr/lib/libc.so.6+0x98a2c)
 #5 0x00007f790804d1a0 raise (/usr/lib/libc.so.6+0x3e1a0)
 #6 0x00007f79080345fe abort (/usr/lib/libc.so.6+0x255fe)
 #7 0x00007f7908035697 (/usr/lib/libc.so.6+0x26697)
 #8 0x00007f79080450b2 __assert_fail (/usr/lib/libc.so.6+0x360b2)
 #9 0x000056405048a616 (anonymous namespace)::InstructionsState::isExpandedBinOp(llvm::Value*) const SLPVectorizer.cpp:0:0
#10 0x00005640504c2c2e llvm::slpvectorizer::BoUpSLP::BlockScheduling::tryScheduleBundle(llvm::ArrayRef<llvm::Value*>, llvm::slpvectorizer::BoUpSLP*, (anonymous namespace)::InstructionsState const&, llvm::slpvectorizer::BoUpSLP::EdgeInfo const&) SLPVectorizer.cpp:0:0
#11 0x00005640504aa616 llvm::slpvectorizer::BoUpSLP::buildTreeRec(llvm::ArrayRef<llvm::Value*>, unsigned int, llvm::slpvectorizer::BoUpSLP::EdgeInfo const&, unsigned int) (./stage1/bin/opt+0x4423616)
#12 0x00005640504ab098 llvm::slpvectorizer::BoUpSLP::buildTreeRec(llvm::ArrayRef<llvm::Value*>, unsigned int, llvm::slpvectorizer::BoUpSLP::EdgeInfo const&, unsigned int) (./stage1/bin/opt+0x4424098)
#13 0x00005640504ab098 llvm::slpvectorizer::BoUpSLP::buildTreeRec(llvm::ArrayRef<llvm::Value*>, unsigned int, llvm::slpvectorizer::BoUpSLP::EdgeInfo const&, unsigned int) (./stage1/bin/opt+0x4424098)
#14 0x00005640504acdab llvm::slpvectorizer::BoUpSLP::buildTree(llvm::ArrayRef<llvm::Value*>) (./stage1/bin/opt+0x4425dab)
#15 0x000056405054b7c7 llvm::SLPVectorizerPass::vectorizeStoreChain(llvm::ArrayRef<llvm::Value*>, llvm::slpvectorizer::BoUpSLP&, unsigned int, unsigned int, unsigned int&) (./stage1/bin/opt+0x44c47c7)
#16 0x000056405054f991 llvm::SLPVectorizerPass::vectorizeStores(llvm::ArrayRef<llvm::StoreInst*>, llvm::slpvectorizer::BoUpSLP&, llvm::DenseSet<std::tuple<llvm::Value*, llvm::Value*, llvm::Value*, llvm::Value*, unsigned int>, llvm::DenseMapInfo<std::tuple<llvm::Value*, llvm::Value*, llvm::Value*, llvm::Value*, unsigned int>, void>>&)::$_0::operator()(std::map<long, unsigned int, std::less<long>, std::allocator<std::pair<long const, unsigned int>>> const&) const SLPVectorizer.cpp:0:0
#17 0x000056405054d4ab llvm::SLPVectorizerPass::vectorizeStores(llvm::ArrayRef<llvm::StoreInst*>, llvm::slpvectorizer::BoUpSLP&, llvm::DenseSet<std::tuple<llvm::Value*, llvm::Value*, llvm::Value*, llvm::Value*, unsigned int>, llvm::DenseMapInfo<std::tuple<llvm::Value*, llvm::Value*, llvm::Value*, llvm::Value*, unsigned int>, void>>&) (./stage1/bin/opt+0x44c64ab)
#18 0x000056405054705a llvm::SLPVectorizerPass::vectorizeStoreChains(llvm::slpvectorizer::BoUpSLP&) (./stage1/bin/opt+0x44c005a)
#19 0x000056405054540e llvm::SLPVectorizerPass::runImpl(llvm::Function&, llvm::ScalarEvolution*, llvm::TargetTransformInfo*, llvm::TargetLibraryInfo*, llvm::AAResults*, llvm::LoopInfo*, llvm::DominatorTree*, llvm::AssumptionCache*, llvm::DemandedBits*, llvm::OptimizationRemarkEmitter*) (./stage1/bin/opt+0x44be40e)
#20 0x0000564050544911 llvm::SLPVectorizerPass::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) (./stage1/bin/opt+0x44bd911)
#21 0x000056404e68b295 llvm::PassManager<llvm::Function, llvm::AnalysisManager<llvm::Function>>::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) (./stage1/bin/opt+0x2604295)
#22 0x000056404e68fd89 llvm::ModuleToFunctionPassAdaptor::run(llvm::Module&, llvm::AnalysisManager<llvm::Module>&) (./stage1/bin/opt+0x2608d89)
#23 0x000056404e689f95 llvm::PassManager<llvm::Module, llvm::AnalysisManager<llvm::Module>>::run(llvm::Module&, llvm::AnalysisManager<llvm::Module>&) (./stage1/bin/opt+0x2602f95)
#24 0x000056404ff26eae llvm::runPassPipeline(llvm::StringRef, llvm::Module&, llvm::TargetMachine*, llvm::TargetLibraryInfoImpl*, llvm::ToolOutputFile*, llvm::ToolOutputFile*, llvm::ToolOutputFile*, llvm::StringRef, llvm::ArrayRef<llvm::PassPlugin>, llvm::ArrayRef<std::function<void (llvm::PassBuilder&)>>, llvm::opt_tool::OutputKind, llvm::opt_tool::VerifierKind, bool, bool, bool, bool, bool, bool, bool, bool) (./stage1/bin/opt+0x3e9feae)
#25 0x000056404e3d3add optMain (./stage1/bin/opt+0x234cadd)
#26 0x00007f79080366c1 (/usr/lib/libc.so.6+0x276c1)
#27 0x00007f79080367f9 __libc_start_main (/usr/lib/libc.so.6+0x277f9)
#28 0x000056404e3cd0a5 _start (./stage1/bin/opt+0x23460a5)
Aborted                    (core dumped) ./stage1/bin/opt -passes=slp-vectorizer -S -o - tc.ll

But works fine if you set -slp-vectorize-non-power-of-2=false.

@alexey-bataev

Copy link
Copy Markdown
Member Author

Hi - I see this has been reverted but are you sure this is profitable as-is? We seemed to see a lot of issues from it - I didn't think that DAG handled non-power 2 cross block vectors very well. That has blocked me in the past from making the cost model better (more accurate, lower cost non-power2 vectors), as more SLP vectorization occurred that led to me scalarization.

The main part is enabled already, this one just enables it for stores/reductions. I did not see significant perf regressions, hope for stores/reductions it should be good enough

@davemgreen

Copy link
Copy Markdown
Contributor

I dont have the example I had back then any more - it was with changes to the cost model to produce better costs for non-power-2 vector sizes like v20f32. (They are 5 * v4f32, not 8 * v4f32). And with the existing SLP vectorizer non-power-2 handling.

The problem I was seeing was something like https://godbolt.org/z/Yj43PeMoa, where non=power2 vectors get scalarized across BB boundaries as they follow the rather silly calling convention. That scalarization didn't seem to be accounted for anywhere in SLP, although it might be better to fix the codegen if we can. GISel doesnt have the same issues.

@alexey-bataev

Copy link
Copy Markdown
Member Author

I dont have the example I had back then any more - it was with changes to the cost model to produce better costs for non-power-2 vector sizes like v20f32. (They are 5 * v4f32, not 8 * v4f32). And with the existing SLP vectorizer non-power-2 handling.

The problem I was seeing was something like https://godbolt.org/z/Yj43PeMoa, where non=power2 vectors get scalarized across BB boundaries as they follow the rather silly calling convention. That scalarization didn't seem to be accounted for anywhere in SLP, although it might be better to fix the codegen if we can. GISel doesnt have the same issues.

That should definitely be fixed in codegen or accounted in the cost model. The conservative solution might be to increase the cost of non-power-of-2 operations, to make SLP prefer power-of-2 more

pedroMVicente pushed a commit to pedroMVicente/llvm-project that referenced this pull request May 19, 2026
Default slp-vectorize-non-power-of-2 to true and broaden the set of
supported widths beyond NumElts + 1 == bit_ceil(NumElts) to include
small widths (<= 5), widths where NumElts - 1 is also non-power of two
(e.g. 6, 7, 10..15), and any width when the elements being vectorized
are themselves vectors (REVEC). Tweak gathered loads, stores, and
reduction support to the non-power-of-2 vector factors.

Reviewers: hiraditya, bababuck, RKSimon

Pull Request: llvm#196825
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants