Skip to content

Revert "[SLP] Vectorize struct-returning intrinsics"#198265

Merged
zmodem merged 6 commits into
llvm:mainfrom
zmodem:revert_slp
May 18, 2026
Merged

Revert "[SLP] Vectorize struct-returning intrinsics"#198265
zmodem merged 6 commits into
llvm:mainfrom
zmodem:revert_slp

Conversation

@zmodem

@zmodem zmodem commented May 18, 2026

Copy link
Copy Markdown
Contributor

It causes assertions failure such as this one. See discussion on the PR.

Constants.cpp:2802:
static Constant *llvm::ConstantExpr::getInsertElement(Constant *, Constant *, Constant *, Type *): Assertion `Val->getType()->isVectorTy() &&
"Tried to create insertelement operation on non-vector type!"' failed.

Allow SLP to combine across lanes calls that return a literal struct
(llvm.sincos, llvm.*.with.overflow, llvm.frexp, ...) into a single
call returning a struct of vectors, by widening {T, T, ...} to
{, ...} via VectorTypeUtils and emitting extractvalue +
extractelement for external uses.

Original Pull Request: #195521

Original Pull Request2: #196756

Recommit after revert #197969

Added check for valid vectorizable type.

Reviewers:

Pull Request: #197994

This reverts commit 1c5e395
and the follow-up or dependent commits landed since:

aa2f124 [SLP] Enable full non-power-of-2 vectorization by default
6e8b6ef [SLP][REVEC] Fix crash on scalable vector types with -slp-revec
8156fce [SLP] Prefer VF-matching scalar-set match in gather-shuffle lookup
97ce93a [SLP]Consider non-profitable trees with buildvector of struct-returning instructions
f0adfab [SLP] Preserve profitable trees when subtree trimming would reduce to buildvector-only

@llvmorg-github-actions

llvmorg-github-actions Bot commented May 18, 2026

Copy link
Copy Markdown

@llvm/pr-subscribers-llvm-transforms
@llvm/pr-subscribers-backend-risc-v

@llvm/pr-subscribers-backend-amdgpu

Author: Hans Wennborg (zmodem)

Changes

It causes assertions failure such as this one. See discussion on the PR.

Constants.cpp:2802:
static Constant *llvm::ConstantExpr::getInsertElement(Constant *, Constant *, Constant *, Type *): Assertion `Val->getType()->isVectorTy() &&
"Tried to create insertelement operation on non-vector type!"' failed.

> Allow SLP to combine across lanes calls that return a literal struct
> (llvm.sincos, llvm.*.with.overflow, llvm.frexp, ...) into a single
> call returning a struct of vectors, by widening {T, T, ...} to
> {<VF x T>, ...} via VectorTypeUtils and emitting extractvalue +
> extractelement for external uses.
>
> Original Pull Request: #195521
>
> Original Pull Request2: #196756
>
> Recommit after revert #197969
>
> Added check for valid vectorizable type.
>
> Reviewers:
>
> Pull Request: #197994

This reverts commit 1c5e395
and the follow-up or dependent commits landed since:

aa2f124 [SLP] Enable full non-power-of-2 vectorization by default
6e8b6ef [SLP][REVEC] Fix crash on scalable vector types with -slp-revec
8156fce [SLP] Prefer VF-matching scalar-set match in gather-shuffle lookup
97ce93a [SLP]Consider non-profitable trees with buildvector of struct-returning instructions
f0adfab [SLP] Preserve profitable trees when subtree trimming would reduce to buildvector-only


Patch is 884.95 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/198265.diff

61 Files Affected:

  • (modified) llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h (+2-2)
  • (modified) llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp (+204-604)
  • (modified) llvm/test/CodeGen/WebAssembly/slp-memory-interleave.ll (+1-1)
  • (modified) llvm/test/Transforms/PhaseOrdering/AArch64/reduce_submuladd.ll (+98-31)
  • (modified) llvm/test/Transforms/PhaseOrdering/X86/vector-reductions.ll (+5-2)
  • (modified) llvm/test/Transforms/SLPVectorizer/AArch64/reduce-fadd.ll (+5-2)
  • (removed) llvm/test/Transforms/SLPVectorizer/AArch64/scalable-type-revec.ll (-22)
  • (modified) llvm/test/Transforms/SLPVectorizer/AArch64/scalable-vector.ll (+1-1)
  • (modified) llvm/test/Transforms/SLPVectorizer/AArch64/trunc-insertion.ll (+11-9)
  • (modified) llvm/test/Transforms/SLPVectorizer/AArch64/tsc-s352.ll (+11-18)
  • (modified) llvm/test/Transforms/SLPVectorizer/AArch64/vectorizable-selects-uniform-cmps.ll (+21-6)
  • (removed) llvm/test/Transforms/SLPVectorizer/AMDGPU/transform-node-gather-struct.ll (-49)
  • (removed) llvm/test/Transforms/SLPVectorizer/RISCV/complex-nonvect-struct-returned.ll (-22)
  • (modified) llvm/test/Transforms/SLPVectorizer/RISCV/horizontal-list.ll (+41-3)
  • (modified) llvm/test/Transforms/SLPVectorizer/RISCV/reduced-value-repeated-and-vectorized.ll (+7-4)
  • (modified) llvm/test/Transforms/SLPVectorizer/RISCV/revec.ll (+60-24)
  • (modified) llvm/test/Transforms/SLPVectorizer/SystemZ/minbitwidth-trunc.ll (+9-5)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/alternate-int-inseltpoison.ll (+10-7)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/alternate-int.ll (+10-7)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/arith-add-saddo.ll (+615-449)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/arith-add-uaddo.ll (+615-449)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/arith-fp-inseltpoison.ll (+32-19)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/arith-fp.ll (+32-19)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/arith-mul-smulo.ll (+615-549)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/arith-mul-umulo.ll (+615-429)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/arith-sub-ssubo.ll (+615-449)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/arith-sub-usubo.ll (+615-449)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/bool-mask.ll (+11-5)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/buildvector-store-chains.ll (+7-5)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/compare-reduce.ll (+6-4)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/deleted-instructions-clear.ll (+10-9)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/dot-product.ll (+42-137)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/entries-shuffled-diff-sizes.ll (+7-7)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/extracts-non-extendable.ll (+2-1)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/gather-extractelements-different-bbs.ll (+8-9)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/horizontal-list.ll (+56-44)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/horizontal-minmax.ll (+76-140)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/multi-use-bitcasted-reduction.ll (+6-3)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/multi_user.ll (+11-6)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll (+11-6)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/parent-node-split-non-schedulable.ll (+7-5)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/reduced-val-vectorized-in-transform.ll (+6-4)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/reduction-bool-logic-op-inside.ll (+9-8)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/reduction-logical.ll (+23-34)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/reduction-same-vals.ll (+9-3)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/reused-extract-scalar-lanes.ll (+3-1)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/revec-non-power-2-to-power-2-large-vect.ll (+3-5)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/rgb_phi.ll (+20-14)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/scalarize-ctlz.ll (+29-48)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/schedule-bundle.ll (+8-4)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/select-copyable-cmp-poison.ll (+6-9)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/select-reduction-op.ll (+8-4)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/trunced-buildvector-scalar-extended.ll (+5-1)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/vec3-base.ll (+12-4)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/vectorize-widest-phis.ll (+4-7)
  • (modified) llvm/test/Transforms/SLPVectorizer/bool-logical-op-reduction-with-poison.ll (+16-9)
  • (modified) llvm/test/Transforms/SLPVectorizer/logical-ops-poisonous-repeated.ll (+9-6)
  • (modified) llvm/test/Transforms/SLPVectorizer/partial-register-extract.ll (+22-11)
  • (modified) llvm/test/Transforms/SLPVectorizer/reduced-gathered-vectorized.ll (+8-2)
  • (modified) llvm/test/Transforms/SLPVectorizer/sincos.ll (+32-20)
  • (modified) llvm/test/Transforms/SLPVectorizer/struct-return-revec.ll (+16-12)
diff --git a/llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h b/llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h
index 23a79df7b2cee..8f512f0fc3ee8 100644
--- a/llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h
+++ b/llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h
@@ -57,9 +57,9 @@ class BoUpSLP;
 
 struct SLPVectorizerPass : public OptionalPassInfoMixin<SLPVectorizerPass> {
   using StoreList = SmallVector<StoreInst *, 8>;
-  using StoreListMap = SmallMapVector<Value *, StoreList, 8>;
+  using StoreListMap = MapVector<Value *, StoreList>;
   using GEPList = SmallVector<GetElementPtrInst *, 8>;
-  using GEPListMap = SmallMapVector<Value *, GEPList, 8>;
+  using GEPListMap = MapVector<Value *, GEPList>;
   using InstSetVector = SmallSetVector<Instruction *, 8>;
 
   ScalarEvolution *SE = nullptr;
diff --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
index 898115005a7dd..3ec332b93caa9 100644
--- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
@@ -28,7 +28,6 @@
 #include "llvm/ADT/SmallPtrSet.h"
 #include "llvm/ADT/SmallSet.h"
 #include "llvm/ADT/SmallString.h"
-#include "llvm/ADT/SmallVectorExtras.h"
 #include "llvm/ADT/Statistic.h"
 #include "llvm/ADT/iterator.h"
 #include "llvm/ADT/iterator_range.h"
@@ -72,7 +71,6 @@
 #include "llvm/IR/User.h"
 #include "llvm/IR/Value.h"
 #include "llvm/IR/ValueHandle.h"
-#include "llvm/IR/VectorTypeUtils.h"
 #ifdef EXPENSIVE_CHECKS
 #include "llvm/IR/Verifier.h"
 #endif
@@ -231,7 +229,7 @@ static cl::opt<bool>
                 cl::desc("Display the SLP trees with Graphviz"));
 
 static cl::opt<bool> VectorizeNonPowerOf2(
-    "slp-vectorize-non-power-of-2", cl::init(true), cl::Hidden,
+    "slp-vectorize-non-power-of-2", cl::init(false), cl::Hidden,
     cl::desc("Try to vectorize with non-power-of-2 number of elements."));
 
 static cl::opt<bool> ForcePostProcessStoresOperands(
@@ -243,19 +241,11 @@ static cl::opt<bool> NonVectReductions(
     cl::desc(
         "Use  non-vectorizable instructions as potential reduction roots."));
 
-static constexpr unsigned SmallProfitableNonPowerOf2 = 5;
-static constexpr unsigned SmallestNonPowerOf2 = 3;
-
 /// True when \p slp-vectorize-non-power-of-2 is enabled and \p NumElts is a
-/// supported non-power-of-2 width. The width is supported if \p NumElts is not
-/// a power of two and either it is small (<= 5, e.g. 3 or 5 lanes), or
-/// \p NumElts - 1 is also not a power of two (e.g. 6, 7, 10..15 lanes), or
-/// the elements being vectorized are themselves vectors (REVEC).
-static bool isAllowedNonPowerOf2VF(unsigned NumElts, bool IsVectorElement) {
-  return VectorizeNonPowerOf2 && !has_single_bit(NumElts) &&
-         ((SLPReVec && IsVectorElement) ||
-          NumElts <= SmallProfitableNonPowerOf2 ||
-          !has_single_bit(NumElts - 1));
+/// supported non-power-of-2 width: \p NumElts + 1 must be a power of two
+/// (e.g. 3 or 7 lanes, i.e. almost a full power-of-2 register).
+static bool isAllowedNonPowerOf2VF(unsigned NumElts) {
+  return VectorizeNonPowerOf2 && has_single_bit(NumElts + 1);
 }
 
 /// Enables vectorization of copyable elements.
@@ -310,10 +300,10 @@ static const unsigned MaxPHINumOperands = 128;
 /// be inevitably scalarized.
 static bool isValidElementType(Type *Ty) {
   // TODO: Support ScalableVectorType.
-  if (SLPReVec && isVectorizedTy(Ty) && !getVectorizedTypeVF(Ty).isScalable())
-    Ty = toScalarizedTy(Ty);
-  return canVectorizeTy(Ty) && !Ty->isX86_FP80Ty() && !Ty->isPPC_FP128Ty() &&
-         !Ty->isVoidTy();
+  if (SLPReVec && isa<FixedVectorType>(Ty))
+    Ty = Ty->getScalarType();
+  return VectorType::isValidElementType(Ty) && !Ty->isX86_FP80Ty() &&
+         !Ty->isPPC_FP128Ty();
 }
 
 /// Returns the "element type" of the given value/instruction \p V.
@@ -338,33 +328,15 @@ static Type *getValueType(Value *V, bool LookThroughCmp = false) {
 static unsigned getNumElements(Type *Ty) {
   assert(!isa<ScalableVectorType>(Ty) &&
          "ScalableVectorType is not supported.");
-  if (isVectorizedTy(Ty))
-    return getVectorizedTypeVF(Ty).getFixedValue();
+  if (auto *VecTy = dyn_cast<FixedVectorType>(Ty))
+    return VecTy->getNumElements();
   return 1;
 }
 
 /// \returns the vector type of ScalarTy based on vectorization factor.
-static Type *getWidenedType(Type *ScalarTy, unsigned VF) {
-  if (VF == 1 && !isVectorizedTy(ScalarTy)) {
-    // Workaround for 1 x vector types: toVectorizedTy returns the type
-    // unchanged when EC is scalar, but BoUpSLP relies on widening to
-    // <1 x ScalarTy> (or struct of <1 x ElTy>) to keep the rest of the
-    // pipeline operating on vector types.
-    if (auto *StructTy = dyn_cast<StructType>(ScalarTy)) {
-      assert(isUnpackedStructLiteral(StructTy) &&
-             "expected unpacked struct literal");
-      assert(all_of(StructTy->elements(), VectorType::isValidElementType) &&
-             "expected all element types to be valid vector element types");
-      return StructType::get(
-          StructTy->getContext(),
-          map_to_vector(StructTy->elements(), [&](Type *ElTy) -> Type * {
-            return FixedVectorType::get(ElTy, 1);
-          }));
-    }
-    return FixedVectorType::get(ScalarTy, 1);
-  }
-  return toVectorizedTy(toScalarizedTy(ScalarTy),
-                        ElementCount::getFixed(VF * getNumElements(ScalarTy)));
+static FixedVectorType *getWidenedType(Type *ScalarTy, unsigned VF) {
+  return FixedVectorType::get(ScalarTy->getScalarType(),
+                              VF * getNumElements(ScalarTy));
 }
 
 /// Returns the number of elements of the given type \p Ty, not less than \p Sz,
@@ -372,7 +344,7 @@ static Type *getWidenedType(Type *ScalarTy, unsigned VF) {
 /// legalization.
 static unsigned getFullVectorNumberOfElements(const TargetTransformInfo &TTI,
                                               Type *Ty, unsigned Sz) {
-  if (!isValidElementType(Ty) || isa<StructType>(Ty))
+  if (!isValidElementType(Ty))
     return bit_ceil(Sz);
   // Find the number of elements, which forms full vectors.
   const unsigned NumParts = TTI.getNumberOfParts(getWidenedType(Ty, Sz));
@@ -387,7 +359,7 @@ static unsigned getFullVectorNumberOfElements(const TargetTransformInfo &TTI,
 static unsigned
 getFloorFullVectorNumberOfElements(const TargetTransformInfo &TTI, Type *Ty,
                                    unsigned Sz) {
-  if (!isValidElementType(Ty) || isa<StructType>(Ty))
+  if (!isValidElementType(Ty))
     return bit_floor(Sz);
   // Find the number of elements, which forms full vectors.
   unsigned NumParts = TTI.getNumberOfParts(getWidenedType(Ty, Sz));
@@ -2067,8 +2039,6 @@ static bool hasFullVectorsOrPowerOf2(const TargetTransformInfo &TTI, Type *Ty,
     return false;
   if (has_single_bit(Sz))
     return true;
-  if (isa<StructType>(Ty))
-    return false;
   const unsigned NumParts = TTI.getNumberOfParts(getWidenedType(Ty, Sz));
   return NumParts > 0 && NumParts < Sz && has_single_bit(Sz / NumParts) &&
          Sz % NumParts == 0;
@@ -2078,20 +2048,19 @@ static bool hasFullVectorsOrPowerOf2(const TargetTransformInfo &TTI, Type *Ty,
 /// phase. If the type is going to be scalarized or does not uses whole
 /// registers, returns 1.
 static unsigned
-getNumberOfParts(const TargetTransformInfo &TTI, Type *VecTy, Type *ScalarTy,
+getNumberOfParts(const TargetTransformInfo &TTI, VectorType *VecTy,
+                 Type *ScalarTy,
                  const unsigned Limit = std::numeric_limits<unsigned>::max()) {
-  if (isa<StructType>(VecTy))
-    return 1;
   unsigned NumParts = TTI.getNumberOfParts(VecTy);
   if (NumParts == 0 || NumParts >= Limit)
     return 1;
   unsigned Sz = getNumElements(VecTy);
   unsigned ScalarSz = getNumElements(ScalarTy);
-  Type *ElementTy = toScalarizedTy(VecTy);
-  unsigned PWSz = getFullVectorNumberOfElements(TTI, ElementTy, Sz);
+  unsigned PWSz =
+      getFullVectorNumberOfElements(TTI, VecTy->getElementType(), Sz);
   if (NumParts >= Sz || PWSz % NumParts != 0 ||
       (PWSz / NumParts) % ScalarSz != 0 ||
-      !hasFullVectorsOrPowerOf2(TTI, ElementTy, PWSz / NumParts))
+      !hasFullVectorsOrPowerOf2(TTI, VecTy->getElementType(), PWSz / NumParts))
     return 1;
   const unsigned NumElts = PWSz / NumParts;
   if (divideCeil(Sz, NumElts) != NumParts)
@@ -2240,14 +2209,14 @@ class slpvectorizer::BoUpSLP {
         ReductionBitWidth >=
             DL->getTypeSizeInBits(
                 VectorizableTree.front()->Scalars.front()->getType()))
-      return cast<FixedVectorType>(
-          getWidenedType(VectorizableTree.front()->Scalars.front()->getType(),
-                         VectorizableTree.front()->getVectorFactor()));
-    return cast<FixedVectorType>(getWidenedType(
+      return getWidenedType(
+          VectorizableTree.front()->Scalars.front()->getType(),
+          VectorizableTree.front()->getVectorFactor());
+    return getWidenedType(
         IntegerType::get(
             VectorizableTree.front()->Scalars.front()->getContext(),
             ReductionBitWidth),
-        VectorizableTree.front()->getVectorFactor()));
+        VectorizableTree.front()->getVectorFactor());
   }
 
   /// Returns true if the tree results in one of the reduced bitcasts variants.
@@ -4020,7 +3989,8 @@ class slpvectorizer::BoUpSLP {
   /// scalar/slot type used to widen into \p VecTy/\p FinalVecTy and may itself
   /// be a FixedVectorType in ReVec mode or an adjusted type due to MinBWs.
   InstructionCost getVectorSpillReloadCost(const TreeEntry *E, Type *ScalarTy,
-                                           Type *VecTy, Type *FinalVecTy,
+                                           VectorType *VecTy,
+                                           VectorType *FinalVecTy,
                                            TTI::TargetCostKind CostKind) const;
 
   /// This is the recursive part of buildTree.
@@ -7137,12 +7107,12 @@ static InstructionCost getExtractWithExtendCost(
     const TargetTransformInfo &TTI, unsigned Opcode, Type *Dst,
     VectorType *VecTy, unsigned Index,
     TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput) {
-  if (isVectorizedTy(Dst)) {
+  if (auto *ScalarTy = dyn_cast<FixedVectorType>(Dst)) {
     assert(SLPReVec && "Only supported by REVEC.");
-    auto *SubTp = cast<FixedVectorType>(
-        getWidenedType(toScalarizedTy(VecTy), getNumElements(Dst)));
+    auto *SubTp =
+        getWidenedType(VecTy->getElementType(), ScalarTy->getNumElements());
     return getShuffleCost(TTI, TTI::SK_ExtractSubvector, VecTy, {}, CostKind,
-                          Index * getNumElements(Dst), SubTp) +
+                          Index * ScalarTy->getNumElements(), SubTp) +
            TTI.getCastInstrCost(Opcode, Dst, SubTp, TTI::CastContextHint::None,
                                 CostKind);
   }
@@ -7235,7 +7205,7 @@ static bool isMaskedLoadCompress(
   InterleaveFactor = 0;
   Type *ScalarTy = VL.front()->getType();
   const size_t Sz = VL.size();
-  auto *VecTy = cast<VectorType>(getWidenedType(ScalarTy, Sz));
+  auto *VecTy = getWidenedType(ScalarTy, Sz);
   constexpr TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput;
   SmallVector<int> Mask;
   if (!Order.empty())
@@ -7271,7 +7241,7 @@ static bool isMaskedLoadCompress(
   // Check for very large distances between elements.
   if (*Diff / Sz >= MaxRegSize / 8)
     return false;
-  LoadVecTy = cast<FixedVectorType>(getWidenedType(ScalarTy, *Diff + 1));
+  LoadVecTy = getWidenedType(ScalarTy, *Diff + 1);
   auto *LI = cast<LoadInst>(Order.empty() ? VL.front() : VL[Order.front()]);
   Align CommonAlignment = LI->getAlign();
   IsMasked = !isSafeToLoadUnconditionally(
@@ -7320,8 +7290,8 @@ static bool isMaskedLoadCompress(
   }
   if (IsStrided && !IsMasked && Order.empty()) {
     // Check for potential segmented(interleaved) loads.
-    VectorType *AlignedLoadVecTy = cast<VectorType>(getWidenedType(
-        ScalarTy, getFullVectorNumberOfElements(TTI, ScalarTy, *Diff + 1)));
+    VectorType *AlignedLoadVecTy = getWidenedType(
+        ScalarTy, getFullVectorNumberOfElements(TTI, ScalarTy, *Diff + 1));
     if (!isSafeToLoadUnconditionally(Ptr0, AlignedLoadVecTy, CommonAlignment,
                                      DL, cast<LoadInst>(VL.back()), &AC, &DT,
                                      &TLI))
@@ -7512,7 +7482,7 @@ bool BoUpSLP::analyzeConstantStrideCandidate(
 
   Type *StrideTy = DL->getIndexType(Ptr0->getType());
   SPtrInfo.StrideVal = ConstantInt::getSigned(StrideTy, StrideIntVal);
-  SPtrInfo.Ty = cast<FixedVectorType>(getWidenedType(NewScalarTy, VecSz));
+  SPtrInfo.Ty = getWidenedType(NewScalarTy, VecSz);
   return true;
 }
 
@@ -7567,8 +7537,7 @@ bool BoUpSLP::analyzeRtStrideCandidate(ArrayRef<Value *> PointerOps,
     NewScalarTy = Type::getIntNTy(
         SE->getContext(),
         DL->getTypeSizeInBits(BaseTy).getFixedValue() * NumOffsets);
-  auto *StridedLoadTy =
-      cast<FixedVectorType>(getWidenedType(NewScalarTy, VecSz));
+  FixedVectorType *StridedLoadTy = getWidenedType(NewScalarTy, VecSz);
   unsigned MinProfitableStridedOps =
       IsLoad ? MinProfitableStridedLoads : MinProfitableStridedStores;
   const unsigned BaseTyNumElts = getNumElements(BaseTy);
@@ -7767,7 +7736,7 @@ BoUpSLP::LoadsState BoUpSLP::canVectorizeLoads(
   // Check the order of pointer operands or that all pointers are the same.
   bool IsSorted = sortPtrAccesses(PointerOps, ScalarTy, *DL, *SE, Order);
 
-  auto *VecTy = cast<VectorType>(getWidenedType(ScalarTy, Sz));
+  auto *VecTy = getWidenedType(ScalarTy, Sz);
   Align CommonAlignment = computeCommonAlignment<LoadInst>(VL);
   // Cache masked gather legality - both the !IsSorted path below and the
   // post-branch check use the same VecTy/CommonAlignment, and the underlying
@@ -7848,7 +7817,7 @@ BoUpSLP::LoadsState BoUpSLP::canVectorizeLoads(
     // estimate as a buildvector, otherwise estimate as splat.
     APInt DemandedElts = APInt::getAllOnes(Sz);
     Type *PtrScalarTy = PointerOps.front()->getType()->getScalarType();
-    auto *PtrVecTy = cast<VectorType>(getWidenedType(PtrScalarTy, Sz));
+    VectorType *PtrVecTy = getWidenedType(PtrScalarTy, Sz);
     // Cache the underlying object of PointerOps.front() - it is invariant
     // across the per-V comparisons below and getUnderlyingObject walks
     // GEP/cast chains.
@@ -7945,7 +7914,7 @@ BoUpSLP::LoadsState BoUpSLP::canVectorizeLoads(
       }
       for (const auto &[SliceStart, LS] : States) {
         const unsigned SliceVF = std::min<unsigned>(VF, VL.size() - SliceStart);
-        auto *SubVecTy = cast<VectorType>(getWidenedType(ScalarTy, SliceVF));
+        auto *SubVecTy = getWidenedType(ScalarTy, SliceVF);
         auto *LI0 = cast<LoadInst>(VL[SliceStart]);
         InstructionCost VectorGEPCost =
             (LS == LoadsState::ScatterVectorize && ProfitableGatherPointers)
@@ -8550,8 +8519,7 @@ BoUpSLP::getReorderingData(const TreeEntry &TE, bool TopToBottom,
       const auto *It = find_if_not(TE.Scalars, isConstant);
       if (It == TE.Scalars.begin())
         return OrdersType();
-      auto *Ty =
-          cast<VectorType>(getWidenedType(TE.Scalars.front()->getType(), Sz));
+      auto *Ty = getWidenedType(TE.Scalars.front()->getType(), Sz);
       if (It != TE.Scalars.end()) {
         OrdersType Order(Sz, Sz);
         unsigned Idx = std::distance(TE.Scalars.begin(), It);
@@ -8672,13 +8640,6 @@ bool BoUpSLP::isProfitableToReorder() const {
   constexpr unsigned TinyTree = 10;
   constexpr unsigned PhiOpsLimit = 12;
   constexpr unsigned GatherLoadsLimit = 2;
-  // Do not reorder splat stores.
-  if (VectorizableTree.size() == 2 &&
-      VectorizableTree.front()->State == TreeEntry::Vectorize &&
-      VectorizableTree.front()->getOpcode() == Instruction::Store &&
-      VectorizableTree.back()->Scalars.front() ==
-          VectorizableTree.back()->Scalars.back())
-    return false;
   if (VectorizableTree.size() <= TinyTree)
     return true;
   if (VectorizableTree.front()->hasState() &&
@@ -8816,12 +8777,6 @@ void BoUpSLP::reorderTopToBottom() {
   // Maps a TreeEntry to the reorder indices of external users.
   DenseMap<const TreeEntry *, SmallVector<OrdersType, 1>>
       ExternalUserReorderMap;
-  // TODO: Reordering of struct types is not supported.
-  if (any_of(VectorizableTree, [](const std::unique_ptr<TreeEntry> &TE) {
-        return TE->State == TreeEntry::Vectorize &&
-               isa<StructType>(getValueType(TE->Scalars.front()));
-      }))
-    return;
   // Compute IgnoreReorder once - it depends only on UserIgnoreList and
   // VectorizableTree.front(), which do not change during this loop.
   const bool IgnoreReorder =
@@ -8848,8 +8803,7 @@ void BoUpSLP::reorderTopToBottom() {
     if (TE->hasState() && TE->isAltShuffle() &&
         TE->State != TreeEntry::SplitVectorize) {
       Type *ScalarTy = TE->Scalars[0]->getType();
-      auto *VecTy =
-          cast<VectorType>(getWidenedType(ScalarTy, TE->Scalars.size()));
+      VectorType *VecTy = getWidenedType(ScalarTy, TE->Scalars.size());
       unsigned Opcode0 = TE->getOpcode();
       unsigned Opcode1 = TE->getAltOpcode();
       SmallBitVector OpcodeMask(
@@ -9218,10 +9172,6 @@ void BoUpSLP::reorderBottomToTop(bool IgnoreReorder) {
     }
     if (Users.first) {
       auto &Data = Users;
-      // TODO: Reordering of struct types is not supported.
-      if (Data.first->State == TreeEntry::Vectorize &&
-          isa<StructType>(getValueType(Data.first->Scalars.front())))
-        continue;
       if (Data.first->State == TreeEntry::SplitVectorize) {
         assert(
             Data.second.size() <= 2 &&
@@ -10022,8 +9972,7 @@ void BoUpSLP::tryToVectorizeGatheredLoads(
     ArrayRef<Value *> Values(reinterpret_cast<Value *const *>(Loads.begin()),
                              Loads.size());
     Align Alignment = computeCommonAlignment<LoadInst>(Values);
-    auto *Ty = cast<VectorType>(
-        getWidenedType(Loads.front()->getType(), Loads.size()));
+    auto *Ty = getWidenedType(Loads.front()->getType(), Loads.size());
     return TTI->isLegalMaskedGather(Ty, Alignment) &&
            !TTI->forceScalarizeMaskedGather(Ty, Alignment);
   };
@@ -10035,13 +9984,8 @@ void BoUpSLP::tryToVectorizeGatheredLoads(
     SmallVector<std::pair<ArrayRef<Value *>, LoadsState>> Results;
     unsigned StartIdx = 0;
     SmallVector<int> CandidateVFs;
-    if (isAllowedNonPowerOf2VF(
-            MaxVF, isa<FixedVectorType>(Loads.front()->getType()))) {
-      const unsigned FullVectorNumElements = getFullVectorNumberOfElements(
-          *TTI, Loads.front()->getType(), MaxVF - 1);
-      if (MaxVF >= SmallestNonPowerOf2 && FullVectorNumElements != MaxVF - 1)
-        CandidateVFs.push_back(MaxVF);
-    }
+    if (isAllowedNonPowerOf2VF(MaxVF))
+      CandidateVFs.push_back(MaxVF);
     for (int NumElts = getFloorFullVectorNumberOfElements(
              *TTI, Loads.front()->getType(), MaxVF);
          NumElts > 1; NumElts = getFloorFullVectorNumberOfElements(
@@ -10326,8 +10270,7 @@ void BoUpSLP::tryToVectorizeGatheredLoads(
                 // Segmented load detected - vectorize at maximum vector factor.
                 if (InterleaveFactor <= Slice.size() &&
                     TTI.isLegalInterleavedAccessType(
-                        cast<VectorType>(
-                            getWidenedType(Slice.front()->getType(), VF)),
+                        getWidenedType(Slice.front()->getType(), VF),
                         InterleaveFactor,
                         cast<LoadInst>(Slice.front())->getAlign(),
                         cast<LoadInst>(Slice.front())
@@ -10587,10 +10530,11 @@ buildIntrinsicArgTypes(const CallInst *CI, const Intrinsic::ID ID,
 /// function (if possible) calls. Returns invalid cost for the corresponding
 /// calls, if they cannot be vectorized/will be scalarized.
 static std::pair<InstructionCost, InstructionCost>
-getVectorCallCosts(CallInst *CI, Type *VecTy, TargetTransformInfo *TTI,
-                   TargetLibraryInfo *TLI, ArrayRef<Type *> ArgTys) {
+getVectorCallCosts(CallInst *CI, FixedVectorType *VecTy,
+                   TargetTransformInfo *TTI, TargetLibraryInfo *TLI,
+                   ArrayRef<Type *> ArgTys) {
   auto Shape = VFShape::get(CI->getFunctionType(),
-                            ElementCount::getFixed(getNumElements(VecTy)),
+                            ElementCount::getFixed(Ve...
[truncated]

@llvmorg-github-actions

Copy link
Copy Markdown

@llvm/pr-subscribers-backend-webassembly

Author: Hans Wennborg (zmodem)

Changes

It causes assertions failure such as this one. See discussion on the PR.

Constants.cpp:2802:
static Constant *llvm::ConstantExpr::getInsertElement(Constant *, Constant *, Constant *, Type *): Assertion `Val->getType()->isVectorTy() &&
"Tried to create insertelement operation on non-vector type!"' failed.

> Allow SLP to combine across lanes calls that return a literal struct
> (llvm.sincos, llvm.*.with.overflow, llvm.frexp, ...) into a single
> call returning a struct of vectors, by widening {T, T, ...} to
> {<VF x T>, ...} via VectorTypeUtils and emitting extractvalue +
> extractelement for external uses.
>
> Original Pull Request: #195521
>
> Original Pull Request2: #196756
>
> Recommit after revert #197969
>
> Added check for valid vectorizable type.
>
> Reviewers:
>
> Pull Request: #197994

This reverts commit 1c5e395
and the follow-up or dependent commits landed since:

aa2f124 [SLP] Enable full non-power-of-2 vectorization by default
6e8b6ef [SLP][REVEC] Fix crash on scalable vector types with -slp-revec
8156fce [SLP] Prefer VF-matching scalar-set match in gather-shuffle lookup
97ce93a [SLP]Consider non-profitable trees with buildvector of struct-returning instructions
f0adfab [SLP] Preserve profitable trees when subtree trimming would reduce to buildvector-only


Patch is 884.95 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/198265.diff

61 Files Affected:

  • (modified) llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h (+2-2)
  • (modified) llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp (+204-604)
  • (modified) llvm/test/CodeGen/WebAssembly/slp-memory-interleave.ll (+1-1)
  • (modified) llvm/test/Transforms/PhaseOrdering/AArch64/reduce_submuladd.ll (+98-31)
  • (modified) llvm/test/Transforms/PhaseOrdering/X86/vector-reductions.ll (+5-2)
  • (modified) llvm/test/Transforms/SLPVectorizer/AArch64/reduce-fadd.ll (+5-2)
  • (removed) llvm/test/Transforms/SLPVectorizer/AArch64/scalable-type-revec.ll (-22)
  • (modified) llvm/test/Transforms/SLPVectorizer/AArch64/scalable-vector.ll (+1-1)
  • (modified) llvm/test/Transforms/SLPVectorizer/AArch64/trunc-insertion.ll (+11-9)
  • (modified) llvm/test/Transforms/SLPVectorizer/AArch64/tsc-s352.ll (+11-18)
  • (modified) llvm/test/Transforms/SLPVectorizer/AArch64/vectorizable-selects-uniform-cmps.ll (+21-6)
  • (removed) llvm/test/Transforms/SLPVectorizer/AMDGPU/transform-node-gather-struct.ll (-49)
  • (removed) llvm/test/Transforms/SLPVectorizer/RISCV/complex-nonvect-struct-returned.ll (-22)
  • (modified) llvm/test/Transforms/SLPVectorizer/RISCV/horizontal-list.ll (+41-3)
  • (modified) llvm/test/Transforms/SLPVectorizer/RISCV/reduced-value-repeated-and-vectorized.ll (+7-4)
  • (modified) llvm/test/Transforms/SLPVectorizer/RISCV/revec.ll (+60-24)
  • (modified) llvm/test/Transforms/SLPVectorizer/SystemZ/minbitwidth-trunc.ll (+9-5)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/alternate-int-inseltpoison.ll (+10-7)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/alternate-int.ll (+10-7)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/arith-add-saddo.ll (+615-449)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/arith-add-uaddo.ll (+615-449)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/arith-fp-inseltpoison.ll (+32-19)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/arith-fp.ll (+32-19)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/arith-mul-smulo.ll (+615-549)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/arith-mul-umulo.ll (+615-429)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/arith-sub-ssubo.ll (+615-449)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/arith-sub-usubo.ll (+615-449)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/bool-mask.ll (+11-5)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/buildvector-store-chains.ll (+7-5)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/compare-reduce.ll (+6-4)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/deleted-instructions-clear.ll (+10-9)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/dot-product.ll (+42-137)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/entries-shuffled-diff-sizes.ll (+7-7)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/extracts-non-extendable.ll (+2-1)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/gather-extractelements-different-bbs.ll (+8-9)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/horizontal-list.ll (+56-44)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/horizontal-minmax.ll (+76-140)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/multi-use-bitcasted-reduction.ll (+6-3)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/multi_user.ll (+11-6)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll (+11-6)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/parent-node-split-non-schedulable.ll (+7-5)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/reduced-val-vectorized-in-transform.ll (+6-4)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/reduction-bool-logic-op-inside.ll (+9-8)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/reduction-logical.ll (+23-34)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/reduction-same-vals.ll (+9-3)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/reused-extract-scalar-lanes.ll (+3-1)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/revec-non-power-2-to-power-2-large-vect.ll (+3-5)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/rgb_phi.ll (+20-14)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/scalarize-ctlz.ll (+29-48)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/schedule-bundle.ll (+8-4)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/select-copyable-cmp-poison.ll (+6-9)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/select-reduction-op.ll (+8-4)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/trunced-buildvector-scalar-extended.ll (+5-1)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/vec3-base.ll (+12-4)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/vectorize-widest-phis.ll (+4-7)
  • (modified) llvm/test/Transforms/SLPVectorizer/bool-logical-op-reduction-with-poison.ll (+16-9)
  • (modified) llvm/test/Transforms/SLPVectorizer/logical-ops-poisonous-repeated.ll (+9-6)
  • (modified) llvm/test/Transforms/SLPVectorizer/partial-register-extract.ll (+22-11)
  • (modified) llvm/test/Transforms/SLPVectorizer/reduced-gathered-vectorized.ll (+8-2)
  • (modified) llvm/test/Transforms/SLPVectorizer/sincos.ll (+32-20)
  • (modified) llvm/test/Transforms/SLPVectorizer/struct-return-revec.ll (+16-12)
diff --git a/llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h b/llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h
index 23a79df7b2cee..8f512f0fc3ee8 100644
--- a/llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h
+++ b/llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h
@@ -57,9 +57,9 @@ class BoUpSLP;
 
 struct SLPVectorizerPass : public OptionalPassInfoMixin<SLPVectorizerPass> {
   using StoreList = SmallVector<StoreInst *, 8>;
-  using StoreListMap = SmallMapVector<Value *, StoreList, 8>;
+  using StoreListMap = MapVector<Value *, StoreList>;
   using GEPList = SmallVector<GetElementPtrInst *, 8>;
-  using GEPListMap = SmallMapVector<Value *, GEPList, 8>;
+  using GEPListMap = MapVector<Value *, GEPList>;
   using InstSetVector = SmallSetVector<Instruction *, 8>;
 
   ScalarEvolution *SE = nullptr;
diff --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
index 898115005a7dd..3ec332b93caa9 100644
--- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
@@ -28,7 +28,6 @@
 #include "llvm/ADT/SmallPtrSet.h"
 #include "llvm/ADT/SmallSet.h"
 #include "llvm/ADT/SmallString.h"
-#include "llvm/ADT/SmallVectorExtras.h"
 #include "llvm/ADT/Statistic.h"
 #include "llvm/ADT/iterator.h"
 #include "llvm/ADT/iterator_range.h"
@@ -72,7 +71,6 @@
 #include "llvm/IR/User.h"
 #include "llvm/IR/Value.h"
 #include "llvm/IR/ValueHandle.h"
-#include "llvm/IR/VectorTypeUtils.h"
 #ifdef EXPENSIVE_CHECKS
 #include "llvm/IR/Verifier.h"
 #endif
@@ -231,7 +229,7 @@ static cl::opt<bool>
                 cl::desc("Display the SLP trees with Graphviz"));
 
 static cl::opt<bool> VectorizeNonPowerOf2(
-    "slp-vectorize-non-power-of-2", cl::init(true), cl::Hidden,
+    "slp-vectorize-non-power-of-2", cl::init(false), cl::Hidden,
     cl::desc("Try to vectorize with non-power-of-2 number of elements."));
 
 static cl::opt<bool> ForcePostProcessStoresOperands(
@@ -243,19 +241,11 @@ static cl::opt<bool> NonVectReductions(
     cl::desc(
         "Use  non-vectorizable instructions as potential reduction roots."));
 
-static constexpr unsigned SmallProfitableNonPowerOf2 = 5;
-static constexpr unsigned SmallestNonPowerOf2 = 3;
-
 /// True when \p slp-vectorize-non-power-of-2 is enabled and \p NumElts is a
-/// supported non-power-of-2 width. The width is supported if \p NumElts is not
-/// a power of two and either it is small (<= 5, e.g. 3 or 5 lanes), or
-/// \p NumElts - 1 is also not a power of two (e.g. 6, 7, 10..15 lanes), or
-/// the elements being vectorized are themselves vectors (REVEC).
-static bool isAllowedNonPowerOf2VF(unsigned NumElts, bool IsVectorElement) {
-  return VectorizeNonPowerOf2 && !has_single_bit(NumElts) &&
-         ((SLPReVec && IsVectorElement) ||
-          NumElts <= SmallProfitableNonPowerOf2 ||
-          !has_single_bit(NumElts - 1));
+/// supported non-power-of-2 width: \p NumElts + 1 must be a power of two
+/// (e.g. 3 or 7 lanes, i.e. almost a full power-of-2 register).
+static bool isAllowedNonPowerOf2VF(unsigned NumElts) {
+  return VectorizeNonPowerOf2 && has_single_bit(NumElts + 1);
 }
 
 /// Enables vectorization of copyable elements.
@@ -310,10 +300,10 @@ static const unsigned MaxPHINumOperands = 128;
 /// be inevitably scalarized.
 static bool isValidElementType(Type *Ty) {
   // TODO: Support ScalableVectorType.
-  if (SLPReVec && isVectorizedTy(Ty) && !getVectorizedTypeVF(Ty).isScalable())
-    Ty = toScalarizedTy(Ty);
-  return canVectorizeTy(Ty) && !Ty->isX86_FP80Ty() && !Ty->isPPC_FP128Ty() &&
-         !Ty->isVoidTy();
+  if (SLPReVec && isa<FixedVectorType>(Ty))
+    Ty = Ty->getScalarType();
+  return VectorType::isValidElementType(Ty) && !Ty->isX86_FP80Ty() &&
+         !Ty->isPPC_FP128Ty();
 }
 
 /// Returns the "element type" of the given value/instruction \p V.
@@ -338,33 +328,15 @@ static Type *getValueType(Value *V, bool LookThroughCmp = false) {
 static unsigned getNumElements(Type *Ty) {
   assert(!isa<ScalableVectorType>(Ty) &&
          "ScalableVectorType is not supported.");
-  if (isVectorizedTy(Ty))
-    return getVectorizedTypeVF(Ty).getFixedValue();
+  if (auto *VecTy = dyn_cast<FixedVectorType>(Ty))
+    return VecTy->getNumElements();
   return 1;
 }
 
 /// \returns the vector type of ScalarTy based on vectorization factor.
-static Type *getWidenedType(Type *ScalarTy, unsigned VF) {
-  if (VF == 1 && !isVectorizedTy(ScalarTy)) {
-    // Workaround for 1 x vector types: toVectorizedTy returns the type
-    // unchanged when EC is scalar, but BoUpSLP relies on widening to
-    // <1 x ScalarTy> (or struct of <1 x ElTy>) to keep the rest of the
-    // pipeline operating on vector types.
-    if (auto *StructTy = dyn_cast<StructType>(ScalarTy)) {
-      assert(isUnpackedStructLiteral(StructTy) &&
-             "expected unpacked struct literal");
-      assert(all_of(StructTy->elements(), VectorType::isValidElementType) &&
-             "expected all element types to be valid vector element types");
-      return StructType::get(
-          StructTy->getContext(),
-          map_to_vector(StructTy->elements(), [&](Type *ElTy) -> Type * {
-            return FixedVectorType::get(ElTy, 1);
-          }));
-    }
-    return FixedVectorType::get(ScalarTy, 1);
-  }
-  return toVectorizedTy(toScalarizedTy(ScalarTy),
-                        ElementCount::getFixed(VF * getNumElements(ScalarTy)));
+static FixedVectorType *getWidenedType(Type *ScalarTy, unsigned VF) {
+  return FixedVectorType::get(ScalarTy->getScalarType(),
+                              VF * getNumElements(ScalarTy));
 }
 
 /// Returns the number of elements of the given type \p Ty, not less than \p Sz,
@@ -372,7 +344,7 @@ static Type *getWidenedType(Type *ScalarTy, unsigned VF) {
 /// legalization.
 static unsigned getFullVectorNumberOfElements(const TargetTransformInfo &TTI,
                                               Type *Ty, unsigned Sz) {
-  if (!isValidElementType(Ty) || isa<StructType>(Ty))
+  if (!isValidElementType(Ty))
     return bit_ceil(Sz);
   // Find the number of elements, which forms full vectors.
   const unsigned NumParts = TTI.getNumberOfParts(getWidenedType(Ty, Sz));
@@ -387,7 +359,7 @@ static unsigned getFullVectorNumberOfElements(const TargetTransformInfo &TTI,
 static unsigned
 getFloorFullVectorNumberOfElements(const TargetTransformInfo &TTI, Type *Ty,
                                    unsigned Sz) {
-  if (!isValidElementType(Ty) || isa<StructType>(Ty))
+  if (!isValidElementType(Ty))
     return bit_floor(Sz);
   // Find the number of elements, which forms full vectors.
   unsigned NumParts = TTI.getNumberOfParts(getWidenedType(Ty, Sz));
@@ -2067,8 +2039,6 @@ static bool hasFullVectorsOrPowerOf2(const TargetTransformInfo &TTI, Type *Ty,
     return false;
   if (has_single_bit(Sz))
     return true;
-  if (isa<StructType>(Ty))
-    return false;
   const unsigned NumParts = TTI.getNumberOfParts(getWidenedType(Ty, Sz));
   return NumParts > 0 && NumParts < Sz && has_single_bit(Sz / NumParts) &&
          Sz % NumParts == 0;
@@ -2078,20 +2048,19 @@ static bool hasFullVectorsOrPowerOf2(const TargetTransformInfo &TTI, Type *Ty,
 /// phase. If the type is going to be scalarized or does not uses whole
 /// registers, returns 1.
 static unsigned
-getNumberOfParts(const TargetTransformInfo &TTI, Type *VecTy, Type *ScalarTy,
+getNumberOfParts(const TargetTransformInfo &TTI, VectorType *VecTy,
+                 Type *ScalarTy,
                  const unsigned Limit = std::numeric_limits<unsigned>::max()) {
-  if (isa<StructType>(VecTy))
-    return 1;
   unsigned NumParts = TTI.getNumberOfParts(VecTy);
   if (NumParts == 0 || NumParts >= Limit)
     return 1;
   unsigned Sz = getNumElements(VecTy);
   unsigned ScalarSz = getNumElements(ScalarTy);
-  Type *ElementTy = toScalarizedTy(VecTy);
-  unsigned PWSz = getFullVectorNumberOfElements(TTI, ElementTy, Sz);
+  unsigned PWSz =
+      getFullVectorNumberOfElements(TTI, VecTy->getElementType(), Sz);
   if (NumParts >= Sz || PWSz % NumParts != 0 ||
       (PWSz / NumParts) % ScalarSz != 0 ||
-      !hasFullVectorsOrPowerOf2(TTI, ElementTy, PWSz / NumParts))
+      !hasFullVectorsOrPowerOf2(TTI, VecTy->getElementType(), PWSz / NumParts))
     return 1;
   const unsigned NumElts = PWSz / NumParts;
   if (divideCeil(Sz, NumElts) != NumParts)
@@ -2240,14 +2209,14 @@ class slpvectorizer::BoUpSLP {
         ReductionBitWidth >=
             DL->getTypeSizeInBits(
                 VectorizableTree.front()->Scalars.front()->getType()))
-      return cast<FixedVectorType>(
-          getWidenedType(VectorizableTree.front()->Scalars.front()->getType(),
-                         VectorizableTree.front()->getVectorFactor()));
-    return cast<FixedVectorType>(getWidenedType(
+      return getWidenedType(
+          VectorizableTree.front()->Scalars.front()->getType(),
+          VectorizableTree.front()->getVectorFactor());
+    return getWidenedType(
         IntegerType::get(
             VectorizableTree.front()->Scalars.front()->getContext(),
             ReductionBitWidth),
-        VectorizableTree.front()->getVectorFactor()));
+        VectorizableTree.front()->getVectorFactor());
   }
 
   /// Returns true if the tree results in one of the reduced bitcasts variants.
@@ -4020,7 +3989,8 @@ class slpvectorizer::BoUpSLP {
   /// scalar/slot type used to widen into \p VecTy/\p FinalVecTy and may itself
   /// be a FixedVectorType in ReVec mode or an adjusted type due to MinBWs.
   InstructionCost getVectorSpillReloadCost(const TreeEntry *E, Type *ScalarTy,
-                                           Type *VecTy, Type *FinalVecTy,
+                                           VectorType *VecTy,
+                                           VectorType *FinalVecTy,
                                            TTI::TargetCostKind CostKind) const;
 
   /// This is the recursive part of buildTree.
@@ -7137,12 +7107,12 @@ static InstructionCost getExtractWithExtendCost(
     const TargetTransformInfo &TTI, unsigned Opcode, Type *Dst,
     VectorType *VecTy, unsigned Index,
     TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput) {
-  if (isVectorizedTy(Dst)) {
+  if (auto *ScalarTy = dyn_cast<FixedVectorType>(Dst)) {
     assert(SLPReVec && "Only supported by REVEC.");
-    auto *SubTp = cast<FixedVectorType>(
-        getWidenedType(toScalarizedTy(VecTy), getNumElements(Dst)));
+    auto *SubTp =
+        getWidenedType(VecTy->getElementType(), ScalarTy->getNumElements());
     return getShuffleCost(TTI, TTI::SK_ExtractSubvector, VecTy, {}, CostKind,
-                          Index * getNumElements(Dst), SubTp) +
+                          Index * ScalarTy->getNumElements(), SubTp) +
            TTI.getCastInstrCost(Opcode, Dst, SubTp, TTI::CastContextHint::None,
                                 CostKind);
   }
@@ -7235,7 +7205,7 @@ static bool isMaskedLoadCompress(
   InterleaveFactor = 0;
   Type *ScalarTy = VL.front()->getType();
   const size_t Sz = VL.size();
-  auto *VecTy = cast<VectorType>(getWidenedType(ScalarTy, Sz));
+  auto *VecTy = getWidenedType(ScalarTy, Sz);
   constexpr TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput;
   SmallVector<int> Mask;
   if (!Order.empty())
@@ -7271,7 +7241,7 @@ static bool isMaskedLoadCompress(
   // Check for very large distances between elements.
   if (*Diff / Sz >= MaxRegSize / 8)
     return false;
-  LoadVecTy = cast<FixedVectorType>(getWidenedType(ScalarTy, *Diff + 1));
+  LoadVecTy = getWidenedType(ScalarTy, *Diff + 1);
   auto *LI = cast<LoadInst>(Order.empty() ? VL.front() : VL[Order.front()]);
   Align CommonAlignment = LI->getAlign();
   IsMasked = !isSafeToLoadUnconditionally(
@@ -7320,8 +7290,8 @@ static bool isMaskedLoadCompress(
   }
   if (IsStrided && !IsMasked && Order.empty()) {
     // Check for potential segmented(interleaved) loads.
-    VectorType *AlignedLoadVecTy = cast<VectorType>(getWidenedType(
-        ScalarTy, getFullVectorNumberOfElements(TTI, ScalarTy, *Diff + 1)));
+    VectorType *AlignedLoadVecTy = getWidenedType(
+        ScalarTy, getFullVectorNumberOfElements(TTI, ScalarTy, *Diff + 1));
     if (!isSafeToLoadUnconditionally(Ptr0, AlignedLoadVecTy, CommonAlignment,
                                      DL, cast<LoadInst>(VL.back()), &AC, &DT,
                                      &TLI))
@@ -7512,7 +7482,7 @@ bool BoUpSLP::analyzeConstantStrideCandidate(
 
   Type *StrideTy = DL->getIndexType(Ptr0->getType());
   SPtrInfo.StrideVal = ConstantInt::getSigned(StrideTy, StrideIntVal);
-  SPtrInfo.Ty = cast<FixedVectorType>(getWidenedType(NewScalarTy, VecSz));
+  SPtrInfo.Ty = getWidenedType(NewScalarTy, VecSz);
   return true;
 }
 
@@ -7567,8 +7537,7 @@ bool BoUpSLP::analyzeRtStrideCandidate(ArrayRef<Value *> PointerOps,
     NewScalarTy = Type::getIntNTy(
         SE->getContext(),
         DL->getTypeSizeInBits(BaseTy).getFixedValue() * NumOffsets);
-  auto *StridedLoadTy =
-      cast<FixedVectorType>(getWidenedType(NewScalarTy, VecSz));
+  FixedVectorType *StridedLoadTy = getWidenedType(NewScalarTy, VecSz);
   unsigned MinProfitableStridedOps =
       IsLoad ? MinProfitableStridedLoads : MinProfitableStridedStores;
   const unsigned BaseTyNumElts = getNumElements(BaseTy);
@@ -7767,7 +7736,7 @@ BoUpSLP::LoadsState BoUpSLP::canVectorizeLoads(
   // Check the order of pointer operands or that all pointers are the same.
   bool IsSorted = sortPtrAccesses(PointerOps, ScalarTy, *DL, *SE, Order);
 
-  auto *VecTy = cast<VectorType>(getWidenedType(ScalarTy, Sz));
+  auto *VecTy = getWidenedType(ScalarTy, Sz);
   Align CommonAlignment = computeCommonAlignment<LoadInst>(VL);
   // Cache masked gather legality - both the !IsSorted path below and the
   // post-branch check use the same VecTy/CommonAlignment, and the underlying
@@ -7848,7 +7817,7 @@ BoUpSLP::LoadsState BoUpSLP::canVectorizeLoads(
     // estimate as a buildvector, otherwise estimate as splat.
     APInt DemandedElts = APInt::getAllOnes(Sz);
     Type *PtrScalarTy = PointerOps.front()->getType()->getScalarType();
-    auto *PtrVecTy = cast<VectorType>(getWidenedType(PtrScalarTy, Sz));
+    VectorType *PtrVecTy = getWidenedType(PtrScalarTy, Sz);
     // Cache the underlying object of PointerOps.front() - it is invariant
     // across the per-V comparisons below and getUnderlyingObject walks
     // GEP/cast chains.
@@ -7945,7 +7914,7 @@ BoUpSLP::LoadsState BoUpSLP::canVectorizeLoads(
       }
       for (const auto &[SliceStart, LS] : States) {
         const unsigned SliceVF = std::min<unsigned>(VF, VL.size() - SliceStart);
-        auto *SubVecTy = cast<VectorType>(getWidenedType(ScalarTy, SliceVF));
+        auto *SubVecTy = getWidenedType(ScalarTy, SliceVF);
         auto *LI0 = cast<LoadInst>(VL[SliceStart]);
         InstructionCost VectorGEPCost =
             (LS == LoadsState::ScatterVectorize && ProfitableGatherPointers)
@@ -8550,8 +8519,7 @@ BoUpSLP::getReorderingData(const TreeEntry &TE, bool TopToBottom,
       const auto *It = find_if_not(TE.Scalars, isConstant);
       if (It == TE.Scalars.begin())
         return OrdersType();
-      auto *Ty =
-          cast<VectorType>(getWidenedType(TE.Scalars.front()->getType(), Sz));
+      auto *Ty = getWidenedType(TE.Scalars.front()->getType(), Sz);
       if (It != TE.Scalars.end()) {
         OrdersType Order(Sz, Sz);
         unsigned Idx = std::distance(TE.Scalars.begin(), It);
@@ -8672,13 +8640,6 @@ bool BoUpSLP::isProfitableToReorder() const {
   constexpr unsigned TinyTree = 10;
   constexpr unsigned PhiOpsLimit = 12;
   constexpr unsigned GatherLoadsLimit = 2;
-  // Do not reorder splat stores.
-  if (VectorizableTree.size() == 2 &&
-      VectorizableTree.front()->State == TreeEntry::Vectorize &&
-      VectorizableTree.front()->getOpcode() == Instruction::Store &&
-      VectorizableTree.back()->Scalars.front() ==
-          VectorizableTree.back()->Scalars.back())
-    return false;
   if (VectorizableTree.size() <= TinyTree)
     return true;
   if (VectorizableTree.front()->hasState() &&
@@ -8816,12 +8777,6 @@ void BoUpSLP::reorderTopToBottom() {
   // Maps a TreeEntry to the reorder indices of external users.
   DenseMap<const TreeEntry *, SmallVector<OrdersType, 1>>
       ExternalUserReorderMap;
-  // TODO: Reordering of struct types is not supported.
-  if (any_of(VectorizableTree, [](const std::unique_ptr<TreeEntry> &TE) {
-        return TE->State == TreeEntry::Vectorize &&
-               isa<StructType>(getValueType(TE->Scalars.front()));
-      }))
-    return;
   // Compute IgnoreReorder once - it depends only on UserIgnoreList and
   // VectorizableTree.front(), which do not change during this loop.
   const bool IgnoreReorder =
@@ -8848,8 +8803,7 @@ void BoUpSLP::reorderTopToBottom() {
     if (TE->hasState() && TE->isAltShuffle() &&
         TE->State != TreeEntry::SplitVectorize) {
       Type *ScalarTy = TE->Scalars[0]->getType();
-      auto *VecTy =
-          cast<VectorType>(getWidenedType(ScalarTy, TE->Scalars.size()));
+      VectorType *VecTy = getWidenedType(ScalarTy, TE->Scalars.size());
       unsigned Opcode0 = TE->getOpcode();
       unsigned Opcode1 = TE->getAltOpcode();
       SmallBitVector OpcodeMask(
@@ -9218,10 +9172,6 @@ void BoUpSLP::reorderBottomToTop(bool IgnoreReorder) {
     }
     if (Users.first) {
       auto &Data = Users;
-      // TODO: Reordering of struct types is not supported.
-      if (Data.first->State == TreeEntry::Vectorize &&
-          isa<StructType>(getValueType(Data.first->Scalars.front())))
-        continue;
       if (Data.first->State == TreeEntry::SplitVectorize) {
         assert(
             Data.second.size() <= 2 &&
@@ -10022,8 +9972,7 @@ void BoUpSLP::tryToVectorizeGatheredLoads(
     ArrayRef<Value *> Values(reinterpret_cast<Value *const *>(Loads.begin()),
                              Loads.size());
     Align Alignment = computeCommonAlignment<LoadInst>(Values);
-    auto *Ty = cast<VectorType>(
-        getWidenedType(Loads.front()->getType(), Loads.size()));
+    auto *Ty = getWidenedType(Loads.front()->getType(), Loads.size());
     return TTI->isLegalMaskedGather(Ty, Alignment) &&
            !TTI->forceScalarizeMaskedGather(Ty, Alignment);
   };
@@ -10035,13 +9984,8 @@ void BoUpSLP::tryToVectorizeGatheredLoads(
     SmallVector<std::pair<ArrayRef<Value *>, LoadsState>> Results;
     unsigned StartIdx = 0;
     SmallVector<int> CandidateVFs;
-    if (isAllowedNonPowerOf2VF(
-            MaxVF, isa<FixedVectorType>(Loads.front()->getType()))) {
-      const unsigned FullVectorNumElements = getFullVectorNumberOfElements(
-          *TTI, Loads.front()->getType(), MaxVF - 1);
-      if (MaxVF >= SmallestNonPowerOf2 && FullVectorNumElements != MaxVF - 1)
-        CandidateVFs.push_back(MaxVF);
-    }
+    if (isAllowedNonPowerOf2VF(MaxVF))
+      CandidateVFs.push_back(MaxVF);
     for (int NumElts = getFloorFullVectorNumberOfElements(
              *TTI, Loads.front()->getType(), MaxVF);
          NumElts > 1; NumElts = getFloorFullVectorNumberOfElements(
@@ -10326,8 +10270,7 @@ void BoUpSLP::tryToVectorizeGatheredLoads(
                 // Segmented load detected - vectorize at maximum vector factor.
                 if (InterleaveFactor <= Slice.size() &&
                     TTI.isLegalInterleavedAccessType(
-                        cast<VectorType>(
-                            getWidenedType(Slice.front()->getType(), VF)),
+                        getWidenedType(Slice.front()->getType(), VF),
                         InterleaveFactor,
                         cast<LoadInst>(Slice.front())->getAlign(),
                         cast<LoadInst>(Slice.front())
@@ -10587,10 +10530,11 @@ buildIntrinsicArgTypes(const CallInst *CI, const Intrinsic::ID ID,
 /// function (if possible) calls. Returns invalid cost for the corresponding
 /// calls, if they cannot be vectorized/will be scalarized.
 static std::pair<InstructionCost, InstructionCost>
-getVectorCallCosts(CallInst *CI, Type *VecTy, TargetTransformInfo *TTI,
-                   TargetLibraryInfo *TLI, ArrayRef<Type *> ArgTys) {
+getVectorCallCosts(CallInst *CI, FixedVectorType *VecTy,
+                   TargetTransformInfo *TTI, TargetLibraryInfo *TLI,
+                   ArrayRef<Type *> ArgTys) {
   auto Shape = VFShape::get(CI->getFunctionType(),
-                            ElementCount::getFixed(getNumElements(VecTy)),
+                            ElementCount::getFixed(Ve...
[truncated]

@llvmorg-github-actions

Copy link
Copy Markdown

@llvm/pr-subscribers-vectorizers

Author: Hans Wennborg (zmodem)

Changes

It causes assertions failure such as this one. See discussion on the PR.

Constants.cpp:2802:
static Constant *llvm::ConstantExpr::getInsertElement(Constant *, Constant *, Constant *, Type *): Assertion `Val->getType()->isVectorTy() &&
"Tried to create insertelement operation on non-vector type!"' failed.

> Allow SLP to combine across lanes calls that return a literal struct
> (llvm.sincos, llvm.*.with.overflow, llvm.frexp, ...) into a single
> call returning a struct of vectors, by widening {T, T, ...} to
> {<VF x T>, ...} via VectorTypeUtils and emitting extractvalue +
> extractelement for external uses.
>
> Original Pull Request: #195521
>
> Original Pull Request2: #196756
>
> Recommit after revert #197969
>
> Added check for valid vectorizable type.
>
> Reviewers:
>
> Pull Request: #197994

This reverts commit 1c5e395
and the follow-up or dependent commits landed since:

aa2f124 [SLP] Enable full non-power-of-2 vectorization by default
6e8b6ef [SLP][REVEC] Fix crash on scalable vector types with -slp-revec
8156fce [SLP] Prefer VF-matching scalar-set match in gather-shuffle lookup
97ce93a [SLP]Consider non-profitable trees with buildvector of struct-returning instructions
f0adfab [SLP] Preserve profitable trees when subtree trimming would reduce to buildvector-only


Patch is 884.95 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/198265.diff

61 Files Affected:

  • (modified) llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h (+2-2)
  • (modified) llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp (+204-604)
  • (modified) llvm/test/CodeGen/WebAssembly/slp-memory-interleave.ll (+1-1)
  • (modified) llvm/test/Transforms/PhaseOrdering/AArch64/reduce_submuladd.ll (+98-31)
  • (modified) llvm/test/Transforms/PhaseOrdering/X86/vector-reductions.ll (+5-2)
  • (modified) llvm/test/Transforms/SLPVectorizer/AArch64/reduce-fadd.ll (+5-2)
  • (removed) llvm/test/Transforms/SLPVectorizer/AArch64/scalable-type-revec.ll (-22)
  • (modified) llvm/test/Transforms/SLPVectorizer/AArch64/scalable-vector.ll (+1-1)
  • (modified) llvm/test/Transforms/SLPVectorizer/AArch64/trunc-insertion.ll (+11-9)
  • (modified) llvm/test/Transforms/SLPVectorizer/AArch64/tsc-s352.ll (+11-18)
  • (modified) llvm/test/Transforms/SLPVectorizer/AArch64/vectorizable-selects-uniform-cmps.ll (+21-6)
  • (removed) llvm/test/Transforms/SLPVectorizer/AMDGPU/transform-node-gather-struct.ll (-49)
  • (removed) llvm/test/Transforms/SLPVectorizer/RISCV/complex-nonvect-struct-returned.ll (-22)
  • (modified) llvm/test/Transforms/SLPVectorizer/RISCV/horizontal-list.ll (+41-3)
  • (modified) llvm/test/Transforms/SLPVectorizer/RISCV/reduced-value-repeated-and-vectorized.ll (+7-4)
  • (modified) llvm/test/Transforms/SLPVectorizer/RISCV/revec.ll (+60-24)
  • (modified) llvm/test/Transforms/SLPVectorizer/SystemZ/minbitwidth-trunc.ll (+9-5)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/alternate-int-inseltpoison.ll (+10-7)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/alternate-int.ll (+10-7)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/arith-add-saddo.ll (+615-449)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/arith-add-uaddo.ll (+615-449)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/arith-fp-inseltpoison.ll (+32-19)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/arith-fp.ll (+32-19)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/arith-mul-smulo.ll (+615-549)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/arith-mul-umulo.ll (+615-429)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/arith-sub-ssubo.ll (+615-449)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/arith-sub-usubo.ll (+615-449)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/bool-mask.ll (+11-5)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/buildvector-store-chains.ll (+7-5)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/compare-reduce.ll (+6-4)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/deleted-instructions-clear.ll (+10-9)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/dot-product.ll (+42-137)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/entries-shuffled-diff-sizes.ll (+7-7)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/extracts-non-extendable.ll (+2-1)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/gather-extractelements-different-bbs.ll (+8-9)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/horizontal-list.ll (+56-44)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/horizontal-minmax.ll (+76-140)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/multi-use-bitcasted-reduction.ll (+6-3)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/multi_user.ll (+11-6)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll (+11-6)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/parent-node-split-non-schedulable.ll (+7-5)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/reduced-val-vectorized-in-transform.ll (+6-4)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/reduction-bool-logic-op-inside.ll (+9-8)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/reduction-logical.ll (+23-34)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/reduction-same-vals.ll (+9-3)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/reused-extract-scalar-lanes.ll (+3-1)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/revec-non-power-2-to-power-2-large-vect.ll (+3-5)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/rgb_phi.ll (+20-14)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/scalarize-ctlz.ll (+29-48)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/schedule-bundle.ll (+8-4)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/select-copyable-cmp-poison.ll (+6-9)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/select-reduction-op.ll (+8-4)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/trunced-buildvector-scalar-extended.ll (+5-1)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/vec3-base.ll (+12-4)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/vectorize-widest-phis.ll (+4-7)
  • (modified) llvm/test/Transforms/SLPVectorizer/bool-logical-op-reduction-with-poison.ll (+16-9)
  • (modified) llvm/test/Transforms/SLPVectorizer/logical-ops-poisonous-repeated.ll (+9-6)
  • (modified) llvm/test/Transforms/SLPVectorizer/partial-register-extract.ll (+22-11)
  • (modified) llvm/test/Transforms/SLPVectorizer/reduced-gathered-vectorized.ll (+8-2)
  • (modified) llvm/test/Transforms/SLPVectorizer/sincos.ll (+32-20)
  • (modified) llvm/test/Transforms/SLPVectorizer/struct-return-revec.ll (+16-12)
diff --git a/llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h b/llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h
index 23a79df7b2cee..8f512f0fc3ee8 100644
--- a/llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h
+++ b/llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h
@@ -57,9 +57,9 @@ class BoUpSLP;
 
 struct SLPVectorizerPass : public OptionalPassInfoMixin<SLPVectorizerPass> {
   using StoreList = SmallVector<StoreInst *, 8>;
-  using StoreListMap = SmallMapVector<Value *, StoreList, 8>;
+  using StoreListMap = MapVector<Value *, StoreList>;
   using GEPList = SmallVector<GetElementPtrInst *, 8>;
-  using GEPListMap = SmallMapVector<Value *, GEPList, 8>;
+  using GEPListMap = MapVector<Value *, GEPList>;
   using InstSetVector = SmallSetVector<Instruction *, 8>;
 
   ScalarEvolution *SE = nullptr;
diff --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
index 898115005a7dd..3ec332b93caa9 100644
--- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
@@ -28,7 +28,6 @@
 #include "llvm/ADT/SmallPtrSet.h"
 #include "llvm/ADT/SmallSet.h"
 #include "llvm/ADT/SmallString.h"
-#include "llvm/ADT/SmallVectorExtras.h"
 #include "llvm/ADT/Statistic.h"
 #include "llvm/ADT/iterator.h"
 #include "llvm/ADT/iterator_range.h"
@@ -72,7 +71,6 @@
 #include "llvm/IR/User.h"
 #include "llvm/IR/Value.h"
 #include "llvm/IR/ValueHandle.h"
-#include "llvm/IR/VectorTypeUtils.h"
 #ifdef EXPENSIVE_CHECKS
 #include "llvm/IR/Verifier.h"
 #endif
@@ -231,7 +229,7 @@ static cl::opt<bool>
                 cl::desc("Display the SLP trees with Graphviz"));
 
 static cl::opt<bool> VectorizeNonPowerOf2(
-    "slp-vectorize-non-power-of-2", cl::init(true), cl::Hidden,
+    "slp-vectorize-non-power-of-2", cl::init(false), cl::Hidden,
     cl::desc("Try to vectorize with non-power-of-2 number of elements."));
 
 static cl::opt<bool> ForcePostProcessStoresOperands(
@@ -243,19 +241,11 @@ static cl::opt<bool> NonVectReductions(
     cl::desc(
         "Use  non-vectorizable instructions as potential reduction roots."));
 
-static constexpr unsigned SmallProfitableNonPowerOf2 = 5;
-static constexpr unsigned SmallestNonPowerOf2 = 3;
-
 /// True when \p slp-vectorize-non-power-of-2 is enabled and \p NumElts is a
-/// supported non-power-of-2 width. The width is supported if \p NumElts is not
-/// a power of two and either it is small (<= 5, e.g. 3 or 5 lanes), or
-/// \p NumElts - 1 is also not a power of two (e.g. 6, 7, 10..15 lanes), or
-/// the elements being vectorized are themselves vectors (REVEC).
-static bool isAllowedNonPowerOf2VF(unsigned NumElts, bool IsVectorElement) {
-  return VectorizeNonPowerOf2 && !has_single_bit(NumElts) &&
-         ((SLPReVec && IsVectorElement) ||
-          NumElts <= SmallProfitableNonPowerOf2 ||
-          !has_single_bit(NumElts - 1));
+/// supported non-power-of-2 width: \p NumElts + 1 must be a power of two
+/// (e.g. 3 or 7 lanes, i.e. almost a full power-of-2 register).
+static bool isAllowedNonPowerOf2VF(unsigned NumElts) {
+  return VectorizeNonPowerOf2 && has_single_bit(NumElts + 1);
 }
 
 /// Enables vectorization of copyable elements.
@@ -310,10 +300,10 @@ static const unsigned MaxPHINumOperands = 128;
 /// be inevitably scalarized.
 static bool isValidElementType(Type *Ty) {
   // TODO: Support ScalableVectorType.
-  if (SLPReVec && isVectorizedTy(Ty) && !getVectorizedTypeVF(Ty).isScalable())
-    Ty = toScalarizedTy(Ty);
-  return canVectorizeTy(Ty) && !Ty->isX86_FP80Ty() && !Ty->isPPC_FP128Ty() &&
-         !Ty->isVoidTy();
+  if (SLPReVec && isa<FixedVectorType>(Ty))
+    Ty = Ty->getScalarType();
+  return VectorType::isValidElementType(Ty) && !Ty->isX86_FP80Ty() &&
+         !Ty->isPPC_FP128Ty();
 }
 
 /// Returns the "element type" of the given value/instruction \p V.
@@ -338,33 +328,15 @@ static Type *getValueType(Value *V, bool LookThroughCmp = false) {
 static unsigned getNumElements(Type *Ty) {
   assert(!isa<ScalableVectorType>(Ty) &&
          "ScalableVectorType is not supported.");
-  if (isVectorizedTy(Ty))
-    return getVectorizedTypeVF(Ty).getFixedValue();
+  if (auto *VecTy = dyn_cast<FixedVectorType>(Ty))
+    return VecTy->getNumElements();
   return 1;
 }
 
 /// \returns the vector type of ScalarTy based on vectorization factor.
-static Type *getWidenedType(Type *ScalarTy, unsigned VF) {
-  if (VF == 1 && !isVectorizedTy(ScalarTy)) {
-    // Workaround for 1 x vector types: toVectorizedTy returns the type
-    // unchanged when EC is scalar, but BoUpSLP relies on widening to
-    // <1 x ScalarTy> (or struct of <1 x ElTy>) to keep the rest of the
-    // pipeline operating on vector types.
-    if (auto *StructTy = dyn_cast<StructType>(ScalarTy)) {
-      assert(isUnpackedStructLiteral(StructTy) &&
-             "expected unpacked struct literal");
-      assert(all_of(StructTy->elements(), VectorType::isValidElementType) &&
-             "expected all element types to be valid vector element types");
-      return StructType::get(
-          StructTy->getContext(),
-          map_to_vector(StructTy->elements(), [&](Type *ElTy) -> Type * {
-            return FixedVectorType::get(ElTy, 1);
-          }));
-    }
-    return FixedVectorType::get(ScalarTy, 1);
-  }
-  return toVectorizedTy(toScalarizedTy(ScalarTy),
-                        ElementCount::getFixed(VF * getNumElements(ScalarTy)));
+static FixedVectorType *getWidenedType(Type *ScalarTy, unsigned VF) {
+  return FixedVectorType::get(ScalarTy->getScalarType(),
+                              VF * getNumElements(ScalarTy));
 }
 
 /// Returns the number of elements of the given type \p Ty, not less than \p Sz,
@@ -372,7 +344,7 @@ static Type *getWidenedType(Type *ScalarTy, unsigned VF) {
 /// legalization.
 static unsigned getFullVectorNumberOfElements(const TargetTransformInfo &TTI,
                                               Type *Ty, unsigned Sz) {
-  if (!isValidElementType(Ty) || isa<StructType>(Ty))
+  if (!isValidElementType(Ty))
     return bit_ceil(Sz);
   // Find the number of elements, which forms full vectors.
   const unsigned NumParts = TTI.getNumberOfParts(getWidenedType(Ty, Sz));
@@ -387,7 +359,7 @@ static unsigned getFullVectorNumberOfElements(const TargetTransformInfo &TTI,
 static unsigned
 getFloorFullVectorNumberOfElements(const TargetTransformInfo &TTI, Type *Ty,
                                    unsigned Sz) {
-  if (!isValidElementType(Ty) || isa<StructType>(Ty))
+  if (!isValidElementType(Ty))
     return bit_floor(Sz);
   // Find the number of elements, which forms full vectors.
   unsigned NumParts = TTI.getNumberOfParts(getWidenedType(Ty, Sz));
@@ -2067,8 +2039,6 @@ static bool hasFullVectorsOrPowerOf2(const TargetTransformInfo &TTI, Type *Ty,
     return false;
   if (has_single_bit(Sz))
     return true;
-  if (isa<StructType>(Ty))
-    return false;
   const unsigned NumParts = TTI.getNumberOfParts(getWidenedType(Ty, Sz));
   return NumParts > 0 && NumParts < Sz && has_single_bit(Sz / NumParts) &&
          Sz % NumParts == 0;
@@ -2078,20 +2048,19 @@ static bool hasFullVectorsOrPowerOf2(const TargetTransformInfo &TTI, Type *Ty,
 /// phase. If the type is going to be scalarized or does not uses whole
 /// registers, returns 1.
 static unsigned
-getNumberOfParts(const TargetTransformInfo &TTI, Type *VecTy, Type *ScalarTy,
+getNumberOfParts(const TargetTransformInfo &TTI, VectorType *VecTy,
+                 Type *ScalarTy,
                  const unsigned Limit = std::numeric_limits<unsigned>::max()) {
-  if (isa<StructType>(VecTy))
-    return 1;
   unsigned NumParts = TTI.getNumberOfParts(VecTy);
   if (NumParts == 0 || NumParts >= Limit)
     return 1;
   unsigned Sz = getNumElements(VecTy);
   unsigned ScalarSz = getNumElements(ScalarTy);
-  Type *ElementTy = toScalarizedTy(VecTy);
-  unsigned PWSz = getFullVectorNumberOfElements(TTI, ElementTy, Sz);
+  unsigned PWSz =
+      getFullVectorNumberOfElements(TTI, VecTy->getElementType(), Sz);
   if (NumParts >= Sz || PWSz % NumParts != 0 ||
       (PWSz / NumParts) % ScalarSz != 0 ||
-      !hasFullVectorsOrPowerOf2(TTI, ElementTy, PWSz / NumParts))
+      !hasFullVectorsOrPowerOf2(TTI, VecTy->getElementType(), PWSz / NumParts))
     return 1;
   const unsigned NumElts = PWSz / NumParts;
   if (divideCeil(Sz, NumElts) != NumParts)
@@ -2240,14 +2209,14 @@ class slpvectorizer::BoUpSLP {
         ReductionBitWidth >=
             DL->getTypeSizeInBits(
                 VectorizableTree.front()->Scalars.front()->getType()))
-      return cast<FixedVectorType>(
-          getWidenedType(VectorizableTree.front()->Scalars.front()->getType(),
-                         VectorizableTree.front()->getVectorFactor()));
-    return cast<FixedVectorType>(getWidenedType(
+      return getWidenedType(
+          VectorizableTree.front()->Scalars.front()->getType(),
+          VectorizableTree.front()->getVectorFactor());
+    return getWidenedType(
         IntegerType::get(
             VectorizableTree.front()->Scalars.front()->getContext(),
             ReductionBitWidth),
-        VectorizableTree.front()->getVectorFactor()));
+        VectorizableTree.front()->getVectorFactor());
   }
 
   /// Returns true if the tree results in one of the reduced bitcasts variants.
@@ -4020,7 +3989,8 @@ class slpvectorizer::BoUpSLP {
   /// scalar/slot type used to widen into \p VecTy/\p FinalVecTy and may itself
   /// be a FixedVectorType in ReVec mode or an adjusted type due to MinBWs.
   InstructionCost getVectorSpillReloadCost(const TreeEntry *E, Type *ScalarTy,
-                                           Type *VecTy, Type *FinalVecTy,
+                                           VectorType *VecTy,
+                                           VectorType *FinalVecTy,
                                            TTI::TargetCostKind CostKind) const;
 
   /// This is the recursive part of buildTree.
@@ -7137,12 +7107,12 @@ static InstructionCost getExtractWithExtendCost(
     const TargetTransformInfo &TTI, unsigned Opcode, Type *Dst,
     VectorType *VecTy, unsigned Index,
     TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput) {
-  if (isVectorizedTy(Dst)) {
+  if (auto *ScalarTy = dyn_cast<FixedVectorType>(Dst)) {
     assert(SLPReVec && "Only supported by REVEC.");
-    auto *SubTp = cast<FixedVectorType>(
-        getWidenedType(toScalarizedTy(VecTy), getNumElements(Dst)));
+    auto *SubTp =
+        getWidenedType(VecTy->getElementType(), ScalarTy->getNumElements());
     return getShuffleCost(TTI, TTI::SK_ExtractSubvector, VecTy, {}, CostKind,
-                          Index * getNumElements(Dst), SubTp) +
+                          Index * ScalarTy->getNumElements(), SubTp) +
            TTI.getCastInstrCost(Opcode, Dst, SubTp, TTI::CastContextHint::None,
                                 CostKind);
   }
@@ -7235,7 +7205,7 @@ static bool isMaskedLoadCompress(
   InterleaveFactor = 0;
   Type *ScalarTy = VL.front()->getType();
   const size_t Sz = VL.size();
-  auto *VecTy = cast<VectorType>(getWidenedType(ScalarTy, Sz));
+  auto *VecTy = getWidenedType(ScalarTy, Sz);
   constexpr TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput;
   SmallVector<int> Mask;
   if (!Order.empty())
@@ -7271,7 +7241,7 @@ static bool isMaskedLoadCompress(
   // Check for very large distances between elements.
   if (*Diff / Sz >= MaxRegSize / 8)
     return false;
-  LoadVecTy = cast<FixedVectorType>(getWidenedType(ScalarTy, *Diff + 1));
+  LoadVecTy = getWidenedType(ScalarTy, *Diff + 1);
   auto *LI = cast<LoadInst>(Order.empty() ? VL.front() : VL[Order.front()]);
   Align CommonAlignment = LI->getAlign();
   IsMasked = !isSafeToLoadUnconditionally(
@@ -7320,8 +7290,8 @@ static bool isMaskedLoadCompress(
   }
   if (IsStrided && !IsMasked && Order.empty()) {
     // Check for potential segmented(interleaved) loads.
-    VectorType *AlignedLoadVecTy = cast<VectorType>(getWidenedType(
-        ScalarTy, getFullVectorNumberOfElements(TTI, ScalarTy, *Diff + 1)));
+    VectorType *AlignedLoadVecTy = getWidenedType(
+        ScalarTy, getFullVectorNumberOfElements(TTI, ScalarTy, *Diff + 1));
     if (!isSafeToLoadUnconditionally(Ptr0, AlignedLoadVecTy, CommonAlignment,
                                      DL, cast<LoadInst>(VL.back()), &AC, &DT,
                                      &TLI))
@@ -7512,7 +7482,7 @@ bool BoUpSLP::analyzeConstantStrideCandidate(
 
   Type *StrideTy = DL->getIndexType(Ptr0->getType());
   SPtrInfo.StrideVal = ConstantInt::getSigned(StrideTy, StrideIntVal);
-  SPtrInfo.Ty = cast<FixedVectorType>(getWidenedType(NewScalarTy, VecSz));
+  SPtrInfo.Ty = getWidenedType(NewScalarTy, VecSz);
   return true;
 }
 
@@ -7567,8 +7537,7 @@ bool BoUpSLP::analyzeRtStrideCandidate(ArrayRef<Value *> PointerOps,
     NewScalarTy = Type::getIntNTy(
         SE->getContext(),
         DL->getTypeSizeInBits(BaseTy).getFixedValue() * NumOffsets);
-  auto *StridedLoadTy =
-      cast<FixedVectorType>(getWidenedType(NewScalarTy, VecSz));
+  FixedVectorType *StridedLoadTy = getWidenedType(NewScalarTy, VecSz);
   unsigned MinProfitableStridedOps =
       IsLoad ? MinProfitableStridedLoads : MinProfitableStridedStores;
   const unsigned BaseTyNumElts = getNumElements(BaseTy);
@@ -7767,7 +7736,7 @@ BoUpSLP::LoadsState BoUpSLP::canVectorizeLoads(
   // Check the order of pointer operands or that all pointers are the same.
   bool IsSorted = sortPtrAccesses(PointerOps, ScalarTy, *DL, *SE, Order);
 
-  auto *VecTy = cast<VectorType>(getWidenedType(ScalarTy, Sz));
+  auto *VecTy = getWidenedType(ScalarTy, Sz);
   Align CommonAlignment = computeCommonAlignment<LoadInst>(VL);
   // Cache masked gather legality - both the !IsSorted path below and the
   // post-branch check use the same VecTy/CommonAlignment, and the underlying
@@ -7848,7 +7817,7 @@ BoUpSLP::LoadsState BoUpSLP::canVectorizeLoads(
     // estimate as a buildvector, otherwise estimate as splat.
     APInt DemandedElts = APInt::getAllOnes(Sz);
     Type *PtrScalarTy = PointerOps.front()->getType()->getScalarType();
-    auto *PtrVecTy = cast<VectorType>(getWidenedType(PtrScalarTy, Sz));
+    VectorType *PtrVecTy = getWidenedType(PtrScalarTy, Sz);
     // Cache the underlying object of PointerOps.front() - it is invariant
     // across the per-V comparisons below and getUnderlyingObject walks
     // GEP/cast chains.
@@ -7945,7 +7914,7 @@ BoUpSLP::LoadsState BoUpSLP::canVectorizeLoads(
       }
       for (const auto &[SliceStart, LS] : States) {
         const unsigned SliceVF = std::min<unsigned>(VF, VL.size() - SliceStart);
-        auto *SubVecTy = cast<VectorType>(getWidenedType(ScalarTy, SliceVF));
+        auto *SubVecTy = getWidenedType(ScalarTy, SliceVF);
         auto *LI0 = cast<LoadInst>(VL[SliceStart]);
         InstructionCost VectorGEPCost =
             (LS == LoadsState::ScatterVectorize && ProfitableGatherPointers)
@@ -8550,8 +8519,7 @@ BoUpSLP::getReorderingData(const TreeEntry &TE, bool TopToBottom,
       const auto *It = find_if_not(TE.Scalars, isConstant);
       if (It == TE.Scalars.begin())
         return OrdersType();
-      auto *Ty =
-          cast<VectorType>(getWidenedType(TE.Scalars.front()->getType(), Sz));
+      auto *Ty = getWidenedType(TE.Scalars.front()->getType(), Sz);
       if (It != TE.Scalars.end()) {
         OrdersType Order(Sz, Sz);
         unsigned Idx = std::distance(TE.Scalars.begin(), It);
@@ -8672,13 +8640,6 @@ bool BoUpSLP::isProfitableToReorder() const {
   constexpr unsigned TinyTree = 10;
   constexpr unsigned PhiOpsLimit = 12;
   constexpr unsigned GatherLoadsLimit = 2;
-  // Do not reorder splat stores.
-  if (VectorizableTree.size() == 2 &&
-      VectorizableTree.front()->State == TreeEntry::Vectorize &&
-      VectorizableTree.front()->getOpcode() == Instruction::Store &&
-      VectorizableTree.back()->Scalars.front() ==
-          VectorizableTree.back()->Scalars.back())
-    return false;
   if (VectorizableTree.size() <= TinyTree)
     return true;
   if (VectorizableTree.front()->hasState() &&
@@ -8816,12 +8777,6 @@ void BoUpSLP::reorderTopToBottom() {
   // Maps a TreeEntry to the reorder indices of external users.
   DenseMap<const TreeEntry *, SmallVector<OrdersType, 1>>
       ExternalUserReorderMap;
-  // TODO: Reordering of struct types is not supported.
-  if (any_of(VectorizableTree, [](const std::unique_ptr<TreeEntry> &TE) {
-        return TE->State == TreeEntry::Vectorize &&
-               isa<StructType>(getValueType(TE->Scalars.front()));
-      }))
-    return;
   // Compute IgnoreReorder once - it depends only on UserIgnoreList and
   // VectorizableTree.front(), which do not change during this loop.
   const bool IgnoreReorder =
@@ -8848,8 +8803,7 @@ void BoUpSLP::reorderTopToBottom() {
     if (TE->hasState() && TE->isAltShuffle() &&
         TE->State != TreeEntry::SplitVectorize) {
       Type *ScalarTy = TE->Scalars[0]->getType();
-      auto *VecTy =
-          cast<VectorType>(getWidenedType(ScalarTy, TE->Scalars.size()));
+      VectorType *VecTy = getWidenedType(ScalarTy, TE->Scalars.size());
       unsigned Opcode0 = TE->getOpcode();
       unsigned Opcode1 = TE->getAltOpcode();
       SmallBitVector OpcodeMask(
@@ -9218,10 +9172,6 @@ void BoUpSLP::reorderBottomToTop(bool IgnoreReorder) {
     }
     if (Users.first) {
       auto &Data = Users;
-      // TODO: Reordering of struct types is not supported.
-      if (Data.first->State == TreeEntry::Vectorize &&
-          isa<StructType>(getValueType(Data.first->Scalars.front())))
-        continue;
       if (Data.first->State == TreeEntry::SplitVectorize) {
         assert(
             Data.second.size() <= 2 &&
@@ -10022,8 +9972,7 @@ void BoUpSLP::tryToVectorizeGatheredLoads(
     ArrayRef<Value *> Values(reinterpret_cast<Value *const *>(Loads.begin()),
                              Loads.size());
     Align Alignment = computeCommonAlignment<LoadInst>(Values);
-    auto *Ty = cast<VectorType>(
-        getWidenedType(Loads.front()->getType(), Loads.size()));
+    auto *Ty = getWidenedType(Loads.front()->getType(), Loads.size());
     return TTI->isLegalMaskedGather(Ty, Alignment) &&
            !TTI->forceScalarizeMaskedGather(Ty, Alignment);
   };
@@ -10035,13 +9984,8 @@ void BoUpSLP::tryToVectorizeGatheredLoads(
     SmallVector<std::pair<ArrayRef<Value *>, LoadsState>> Results;
     unsigned StartIdx = 0;
     SmallVector<int> CandidateVFs;
-    if (isAllowedNonPowerOf2VF(
-            MaxVF, isa<FixedVectorType>(Loads.front()->getType()))) {
-      const unsigned FullVectorNumElements = getFullVectorNumberOfElements(
-          *TTI, Loads.front()->getType(), MaxVF - 1);
-      if (MaxVF >= SmallestNonPowerOf2 && FullVectorNumElements != MaxVF - 1)
-        CandidateVFs.push_back(MaxVF);
-    }
+    if (isAllowedNonPowerOf2VF(MaxVF))
+      CandidateVFs.push_back(MaxVF);
     for (int NumElts = getFloorFullVectorNumberOfElements(
              *TTI, Loads.front()->getType(), MaxVF);
          NumElts > 1; NumElts = getFloorFullVectorNumberOfElements(
@@ -10326,8 +10270,7 @@ void BoUpSLP::tryToVectorizeGatheredLoads(
                 // Segmented load detected - vectorize at maximum vector factor.
                 if (InterleaveFactor <= Slice.size() &&
                     TTI.isLegalInterleavedAccessType(
-                        cast<VectorType>(
-                            getWidenedType(Slice.front()->getType(), VF)),
+                        getWidenedType(Slice.front()->getType(), VF),
                         InterleaveFactor,
                         cast<LoadInst>(Slice.front())->getAlign(),
                         cast<LoadInst>(Slice.front())
@@ -10587,10 +10530,11 @@ buildIntrinsicArgTypes(const CallInst *CI, const Intrinsic::ID ID,
 /// function (if possible) calls. Returns invalid cost for the corresponding
 /// calls, if they cannot be vectorized/will be scalarized.
 static std::pair<InstructionCost, InstructionCost>
-getVectorCallCosts(CallInst *CI, Type *VecTy, TargetTransformInfo *TTI,
-                   TargetLibraryInfo *TLI, ArrayRef<Type *> ArgTys) {
+getVectorCallCosts(CallInst *CI, FixedVectorType *VecTy,
+                   TargetTransformInfo *TTI, TargetLibraryInfo *TLI,
+                   ArrayRef<Type *> ArgTys) {
   auto Shape = VFShape::get(CI->getFunctionType(),
-                            ElementCount::getFixed(getNumElements(VecTy)),
+                            ElementCount::getFixed(Ve...
[truncated]

@github-actions

Copy link
Copy Markdown

⚠️ undef deprecator found issues in your code. ⚠️

You can test this locally with the following command:
git diff -U0 --pickaxe-regex -S '([^a-zA-Z0-9#_-]undef([^a-zA-Z0-9_-]|$)|UndefValue::get)' 'HEAD~1' HEAD llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp llvm/test/CodeGen/WebAssembly/slp-memory-interleave.ll llvm/test/Transforms/PhaseOrdering/AArch64/reduce_submuladd.ll llvm/test/Transforms/PhaseOrdering/X86/vector-reductions.ll llvm/test/Transforms/SLPVectorizer/AArch64/reduce-fadd.ll llvm/test/Transforms/SLPVectorizer/AArch64/scalable-vector.ll llvm/test/Transforms/SLPVectorizer/AArch64/trunc-insertion.ll llvm/test/Transforms/SLPVectorizer/AArch64/tsc-s352.ll llvm/test/Transforms/SLPVectorizer/AArch64/vectorizable-selects-uniform-cmps.ll llvm/test/Transforms/SLPVectorizer/RISCV/horizontal-list.ll llvm/test/Transforms/SLPVectorizer/RISCV/reduced-value-repeated-and-vectorized.ll llvm/test/Transforms/SLPVectorizer/RISCV/revec.ll llvm/test/Transforms/SLPVectorizer/SystemZ/minbitwidth-trunc.ll llvm/test/Transforms/SLPVectorizer/X86/alternate-int-inseltpoison.ll llvm/test/Transforms/SLPVectorizer/X86/alternate-int.ll llvm/test/Transforms/SLPVectorizer/X86/arith-add-saddo.ll llvm/test/Transforms/SLPVectorizer/X86/arith-add-uaddo.ll llvm/test/Transforms/SLPVectorizer/X86/arith-fp-inseltpoison.ll llvm/test/Transforms/SLPVectorizer/X86/arith-fp.ll llvm/test/Transforms/SLPVectorizer/X86/arith-mul-smulo.ll llvm/test/Transforms/SLPVectorizer/X86/arith-mul-umulo.ll llvm/test/Transforms/SLPVectorizer/X86/arith-sub-ssubo.ll llvm/test/Transforms/SLPVectorizer/X86/arith-sub-usubo.ll llvm/test/Transforms/SLPVectorizer/X86/bool-mask.ll llvm/test/Transforms/SLPVectorizer/X86/buildvector-store-chains.ll llvm/test/Transforms/SLPVectorizer/X86/compare-reduce.ll llvm/test/Transforms/SLPVectorizer/X86/deleted-instructions-clear.ll llvm/test/Transforms/SLPVectorizer/X86/dot-product.ll llvm/test/Transforms/SLPVectorizer/X86/entries-shuffled-diff-sizes.ll llvm/test/Transforms/SLPVectorizer/X86/extracts-non-extendable.ll llvm/test/Transforms/SLPVectorizer/X86/gather-extractelements-different-bbs.ll llvm/test/Transforms/SLPVectorizer/X86/horizontal-list.ll llvm/test/Transforms/SLPVectorizer/X86/horizontal-minmax.ll llvm/test/Transforms/SLPVectorizer/X86/multi-use-bitcasted-reduction.ll llvm/test/Transforms/SLPVectorizer/X86/multi_user.ll llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll llvm/test/Transforms/SLPVectorizer/X86/parent-node-split-non-schedulable.ll llvm/test/Transforms/SLPVectorizer/X86/reduced-val-vectorized-in-transform.ll llvm/test/Transforms/SLPVectorizer/X86/reduction-bool-logic-op-inside.ll llvm/test/Transforms/SLPVectorizer/X86/reduction-logical.ll llvm/test/Transforms/SLPVectorizer/X86/reduction-same-vals.ll llvm/test/Transforms/SLPVectorizer/X86/reused-extract-scalar-lanes.ll llvm/test/Transforms/SLPVectorizer/X86/revec-non-power-2-to-power-2-large-vect.ll llvm/test/Transforms/SLPVectorizer/X86/rgb_phi.ll llvm/test/Transforms/SLPVectorizer/X86/scalarize-ctlz.ll llvm/test/Transforms/SLPVectorizer/X86/schedule-bundle.ll llvm/test/Transforms/SLPVectorizer/X86/select-copyable-cmp-poison.ll llvm/test/Transforms/SLPVectorizer/X86/select-reduction-op.ll llvm/test/Transforms/SLPVectorizer/X86/trunced-buildvector-scalar-extended.ll llvm/test/Transforms/SLPVectorizer/X86/vec3-base.ll llvm/test/Transforms/SLPVectorizer/X86/vectorize-widest-phis.ll llvm/test/Transforms/SLPVectorizer/bool-logical-op-reduction-with-poison.ll llvm/test/Transforms/SLPVectorizer/logical-ops-poisonous-repeated.ll llvm/test/Transforms/SLPVectorizer/partial-register-extract.ll llvm/test/Transforms/SLPVectorizer/reduced-gathered-vectorized.ll llvm/test/Transforms/SLPVectorizer/sincos.ll llvm/test/Transforms/SLPVectorizer/struct-return-revec.ll

The following files introduce new uses of undef:

  • llvm/test/Transforms/SLPVectorizer/X86/arith-fp.ll
  • llvm/test/Transforms/SLPVectorizer/X86/horizontal-minmax.ll

Undef is now deprecated and should only be used in the rare cases where no replacement is possible. For example, a load of uninitialized memory yields undef. You should use poison values for placeholders instead.

In tests, avoid using undef and having tests that trigger undefined behavior. If you need an operand with some unimportant value, you can add a new argument to the function and use that instead.

For example, this is considered a bad practice:

define void @fn() {
  ...
  br i1 undef, ...
}

Please use the following instead:

define void @fn(i1 %cond) {
  ...
  br i1 %cond, ...
}

Please refer to the Undefined Behavior Manual for more information.

@mikaelholmen

Copy link
Copy Markdown
Contributor

Thanks @zmodem !

@Andarwinux

Copy link
Copy Markdown
Contributor

6e8b6ef [SLP][REVEC] Fix crash on scalable vector types with -slp-revec

It fixed #198076, revert will recreate the crash.

@zmodem

zmodem commented May 18, 2026

Copy link
Copy Markdown
Contributor Author

It fixed #198076, revert will recreate the crash.

Thanks for the heads up! I verified that the crash does not reproduce after this PR, so presumably it was introduced by one of the commits being reverted here.

@zmodem zmodem disabled auto-merge May 18, 2026 10:20
@zmodem zmodem merged commit c7c289e into llvm:main May 18, 2026
13 of 17 checks passed
@ms178

ms178 commented May 18, 2026

Copy link
Copy Markdown

This seems to have fixed arithmetic issues that I've seen when compiling pixman-git. Tested with an earlier revision from today before this MR got merged and one revision after this got merged.

The previous last known-good revision was: d90a802

Maybe it helps to pinpoint the root cause.

@alexey-bataev

Copy link
Copy Markdown
Member

This seems to have fixed arithmetic issues that I've seen when compiling pixman-git. Tested with an earlier revision from today before this MR got merged and one revision after this got merged.

The previous last known-good revision was: d90a802

Maybe it helps to pinpoint the root cause.

Provide a reproducer, this does not help at all!

@ms178

ms178 commented May 18, 2026

Copy link
Copy Markdown

Provide a reproducer, this does not help at all!

Sorry, the best I can do is showing the output of two test failures with more clues, reproducible with the package recipe from my repo for CachyOS:

18/35 pixman:alphamap                       FAIL             0.00s   exit status 1
>>> MSAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1:print_stacktrace=1 MESON_TEST_ITERATION=1 UBSAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1:print_stacktrace=1 ASAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1 LD_LIBRARY_PATH=/tmp/makepkg/pixman-git/src/_build_std_pgo/pixman:/tmp/makepkg/pixman-git/src/_build_std_pgo/pixman MALLOC_PERTURB_=249 /tmp/makepkg/pixman-git/src/_build_std_pgo/test/alphamap
――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――― 

Wrong alpha value at (47, 10). Should be 0.8; got 1. Source was 0.533333, original dest was 0.266667
src: a8r8g8b8, alpha: null, origin 0 0
dst: a8r8g8b8, alpha: a4r4g4b4, origin: 10 10

20/35 pixman:pixel-test                     FAIL             0.28s   exit status 1
>>> MALLOC_PERTURB_=201 MSAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1:print_stacktrace=1 MESON_TEST_ITERATION=1 UBSAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1:print_stacktrace=1 ASAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1 LD_LIBRARY_PATH=/tmp/makepkg/pixman-git/src/_build_std_pgo/pixman:/tmp/makepkg/pixman-git/src/_build_std_pgo/pixman /tmp/makepkg/pixman-git/src/_build_std_pgo/test/pixel-test
――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――― 
Listing only the last 100 lines from a long log.
   src format:       a8r8g8b8
   dest format:      a4r4g4b4
 - source ARGB:      0.694118  1.000000  0.000000  0.423529   (pixel: 0xb1ff006c)
                          177       255         0       108
 - dest ARGB:        1.000000  1.000000  0.466667  0.800000   (pixel: 0xff7c)
                           15        15         7        12
 - expected ARGB:    1.000000  1.000000  0.466667  0.800000
   min acceptable:         15        15         7        12
   got:                    15        15        15        15   (pixel: 0xff7cffff)
   max acceptable:         15        15         7        13
----------- Test 521 failed ----------
   operator:         PIXMAN_OP_EXCLUSION (no mask)
   src format:       r5g6b5
   dest format:      a4r4g4b4
 - source ARGB:      1.000000  1.000000  1.000000  1.000000   (pixel: 0xffff)
                            0        31        63        31
 - dest ARGB:        0.066667  0.600000  0.400000  0.533333   (pixel: 0x1968)
                            1         9         6         8
 - expected ARGB:    1.000000  0.400000  0.600000  0.466667
   min acceptable:         15         6         9         7
   got:                    15         0         0         0   (pixel: 0x1968f000)
   max acceptable:         15         6         9         7
----------- Test 523 failed ----------
   operator:         PIXMAN_OP_SOFT_LIGHT (no mask)
   src format:       a4r4g4b4
   dest format:      a8r8g8b8
 - source ARGB:      0.666667  0.333333  0.000000  0.000000   (pixel: 0xa500)
                           10         5         0         0
 - dest ARGB:        0.105882  1.000000  0.000000  0.615686   (pixel: 0x1bff009d)
                           27       255         0       157
 - expected ARGB:    0.701961  1.000000  0.000000  1.000000
   min acceptable:        176       252        -2       252
   got:                   179       254       228       255   (pixel: 0xb3fee4ff)
   max acceptable:        182       258         3       258
----------- Test 526 failed ----------
   operator:         PIXMAN_OP_EXCLUSION (no mask)
   src format:       a4r4g4b4
   dest format:      a8r8g8b8
 - source ARGB:      0.266667  0.133333  0.266667  0.466667   (pixel: 0x4247)
                            4         2         4         7
 - dest ARGB:        0.000000  0.847059  1.000000  1.000000   (pixel: 0x00d8ffff)
                            0       216       255       255
 - expected ARGB:    0.266667  0.754510  0.733333  0.533333
   min acceptable:         64       189       184       133
   got:                   118       134       137       136   (pixel: 0x76868988)
   max acceptable:         71       196       191       139
----------- Test 529 failed ----------
   operator:         PIXMAN_OP_EXCLUSION (no mask)
   src format:       a2r2g2b2
   dest format:      a4r4g4b4
 - source ARGB:      0.666667  0.333333  0.000000  1.000000   (pixel: 0x93)
                            2         1         0         3
 - dest ARGB:        0.000000  0.000000  1.000000  1.000000   (pixel: 0x00ff)
                            0         0        15        15
 - expected ARGB:    0.666667  0.333333  1.000000  0.000000
   min acceptable:         10         5        15         0
   got:                    15        10        15         0   (pixel: 0xfffaf0)
   max acceptable:         10         5        15         0
----------- Test 534 failed ----------
   operator:         PIXMAN_OP_HARD_LIGHT (no mask)
   src format:       a4r4g4b4
   dest format:      a4r4g4b4
 - source ARGB:      0.066667  0.133333  0.800000  0.466667   (pixel: 0x12c7)
                            1         2        12         7
 - dest ARGB:        0.733333  0.066667  0.000000  0.000000   (pixel: 0xb100)
                           11         1         0         0
 - expected ARGB:    0.751111  0.235556  1.000000  0.760000
   min acceptable:         11         3        15        11
   got:                    15        15        15        15   (pixel: 0xb100ffff)
   max acceptable:         12         3        15        12
----------- Test 541 failed ----------
   operator:         PIXMAN_OP_HARD_LIGHT (unified alpha)
   src format:       r5g6b5
   mask format:      a4r4g4b4
   dest format:      a8r8g8b8
 - source ARGB:      1.000000  0.741935  0.047619  0.225806   (pixel: 0xb867)
                            0        23         3         7
 - mask ARGB:        0.533333  0.133333  0.866667  0.600000   (pixel: 0x82d9)
                            8         2        13         9
 - dest ARGB:        0.000000  0.000000  1.000000  0.772549   (pixel: 0x0000ffc5)
                            0         0       255       197
 - expected ARGB:    0.533333  0.395699  0.542857  0.667029
   min acceptable:        133        98       135       167
   got:                   223       165        62       152   (pixel: 0xdfa53e98)
   max acceptable:        139       104       142       174
----------- Test 545 failed ----------
   operator:         PIXMAN_OP_HARD_LIGHT (unified alpha)
   src format:       r5g6b5
   mask format:      a4r4g4b4
   dest format:      a8r8g8b8
 - source ARGB:      1.000000  0.258065  0.238095  1.000000   (pixel: 0x41ff)
                            0         8        15        31
 - mask ARGB:        0.000000  0.800000  1.000000  1.000000   (pixel: 0x0cff)
                            0        12        15        15
 - dest ARGB:        0.000000  0.882353  1.000000  0.000000   (pixel: 0x00e1ff00)
                            0       225       255         0
 - expected ARGB:    0.000000  0.882353  1.000000  0.000000
   min acceptable:         -2       222       252        -2
   got:                   255       182       180       255   (pixel: 0xffb6b4ff)
   max acceptable:          3       229       258         3

@alexey-bataev

Copy link
Copy Markdown
Member

Provide a reproducer, this does not help at all!

Sorry, the best I can do is showing the output of two test failures with more clues, reproducible with the package recipe from my repo for CachyOS:

18/35 pixman:alphamap                       FAIL             0.00s   exit status 1
>>> MSAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1:print_stacktrace=1 MESON_TEST_ITERATION=1 UBSAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1:print_stacktrace=1 ASAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1 LD_LIBRARY_PATH=/tmp/makepkg/pixman-git/src/_build_std_pgo/pixman:/tmp/makepkg/pixman-git/src/_build_std_pgo/pixman MALLOC_PERTURB_=249 /tmp/makepkg/pixman-git/src/_build_std_pgo/test/alphamap
――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――― 

Wrong alpha value at (47, 10). Should be 0.8; got 1. Source was 0.533333, original dest was 0.266667
src: a8r8g8b8, alpha: null, origin 0 0
dst: a8r8g8b8, alpha: a4r4g4b4, origin: 10 10

20/35 pixman:pixel-test                     FAIL             0.28s   exit status 1
>>> MALLOC_PERTURB_=201 MSAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1:print_stacktrace=1 MESON_TEST_ITERATION=1 UBSAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1:print_stacktrace=1 ASAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1 LD_LIBRARY_PATH=/tmp/makepkg/pixman-git/src/_build_std_pgo/pixman:/tmp/makepkg/pixman-git/src/_build_std_pgo/pixman /tmp/makepkg/pixman-git/src/_build_std_pgo/test/pixel-test
――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――― 
Listing only the last 100 lines from a long log.
   src format:       a8r8g8b8
   dest format:      a4r4g4b4
 - source ARGB:      0.694118  1.000000  0.000000  0.423529   (pixel: 0xb1ff006c)
                          177       255         0       108
 - dest ARGB:        1.000000  1.000000  0.466667  0.800000   (pixel: 0xff7c)
                           15        15         7        12
 - expected ARGB:    1.000000  1.000000  0.466667  0.800000
   min acceptable:         15        15         7        12
   got:                    15        15        15        15   (pixel: 0xff7cffff)
   max acceptable:         15        15         7        13
----------- Test 521 failed ----------
   operator:         PIXMAN_OP_EXCLUSION (no mask)
   src format:       r5g6b5
   dest format:      a4r4g4b4
 - source ARGB:      1.000000  1.000000  1.000000  1.000000   (pixel: 0xffff)
                            0        31        63        31
 - dest ARGB:        0.066667  0.600000  0.400000  0.533333   (pixel: 0x1968)
                            1         9         6         8
 - expected ARGB:    1.000000  0.400000  0.600000  0.466667
   min acceptable:         15         6         9         7
   got:                    15         0         0         0   (pixel: 0x1968f000)
   max acceptable:         15         6         9         7
----------- Test 523 failed ----------
   operator:         PIXMAN_OP_SOFT_LIGHT (no mask)
   src format:       a4r4g4b4
   dest format:      a8r8g8b8
 - source ARGB:      0.666667  0.333333  0.000000  0.000000   (pixel: 0xa500)
                           10         5         0         0
 - dest ARGB:        0.105882  1.000000  0.000000  0.615686   (pixel: 0x1bff009d)
                           27       255         0       157
 - expected ARGB:    0.701961  1.000000  0.000000  1.000000
   min acceptable:        176       252        -2       252
   got:                   179       254       228       255   (pixel: 0xb3fee4ff)
   max acceptable:        182       258         3       258
----------- Test 526 failed ----------
   operator:         PIXMAN_OP_EXCLUSION (no mask)
   src format:       a4r4g4b4
   dest format:      a8r8g8b8
 - source ARGB:      0.266667  0.133333  0.266667  0.466667   (pixel: 0x4247)
                            4         2         4         7
 - dest ARGB:        0.000000  0.847059  1.000000  1.000000   (pixel: 0x00d8ffff)
                            0       216       255       255
 - expected ARGB:    0.266667  0.754510  0.733333  0.533333
   min acceptable:         64       189       184       133
   got:                   118       134       137       136   (pixel: 0x76868988)
   max acceptable:         71       196       191       139
----------- Test 529 failed ----------
   operator:         PIXMAN_OP_EXCLUSION (no mask)
   src format:       a2r2g2b2
   dest format:      a4r4g4b4
 - source ARGB:      0.666667  0.333333  0.000000  1.000000   (pixel: 0x93)
                            2         1         0         3
 - dest ARGB:        0.000000  0.000000  1.000000  1.000000   (pixel: 0x00ff)
                            0         0        15        15
 - expected ARGB:    0.666667  0.333333  1.000000  0.000000
   min acceptable:         10         5        15         0
   got:                    15        10        15         0   (pixel: 0xfffaf0)
   max acceptable:         10         5        15         0
----------- Test 534 failed ----------
   operator:         PIXMAN_OP_HARD_LIGHT (no mask)
   src format:       a4r4g4b4
   dest format:      a4r4g4b4
 - source ARGB:      0.066667  0.133333  0.800000  0.466667   (pixel: 0x12c7)
                            1         2        12         7
 - dest ARGB:        0.733333  0.066667  0.000000  0.000000   (pixel: 0xb100)
                           11         1         0         0
 - expected ARGB:    0.751111  0.235556  1.000000  0.760000
   min acceptable:         11         3        15        11
   got:                    15        15        15        15   (pixel: 0xb100ffff)
   max acceptable:         12         3        15        12
----------- Test 541 failed ----------
   operator:         PIXMAN_OP_HARD_LIGHT (unified alpha)
   src format:       r5g6b5
   mask format:      a4r4g4b4
   dest format:      a8r8g8b8
 - source ARGB:      1.000000  0.741935  0.047619  0.225806   (pixel: 0xb867)
                            0        23         3         7
 - mask ARGB:        0.533333  0.133333  0.866667  0.600000   (pixel: 0x82d9)
                            8         2        13         9
 - dest ARGB:        0.000000  0.000000  1.000000  0.772549   (pixel: 0x0000ffc5)
                            0         0       255       197
 - expected ARGB:    0.533333  0.395699  0.542857  0.667029
   min acceptable:        133        98       135       167
   got:                   223       165        62       152   (pixel: 0xdfa53e98)
   max acceptable:        139       104       142       174
----------- Test 545 failed ----------
   operator:         PIXMAN_OP_HARD_LIGHT (unified alpha)
   src format:       r5g6b5
   mask format:      a4r4g4b4
   dest format:      a8r8g8b8
 - source ARGB:      1.000000  0.258065  0.238095  1.000000   (pixel: 0x41ff)
                            0         8        15        31
 - mask ARGB:        0.000000  0.800000  1.000000  1.000000   (pixel: 0x0cff)
                            0        12        15        15
 - dest ARGB:        0.000000  0.882353  1.000000  0.000000   (pixel: 0x00e1ff00)
                            0       225       255         0
 - expected ARGB:    0.000000  0.882353  1.000000  0.000000
   min acceptable:         -2       222       252        -2
   got:                   255       182       180       255   (pixel: 0xffb6b4ff)
   max acceptable:          3       229       258         3

It does not help either, how can I fix the issue if I'm unable to reproduce it?

@ms178

ms178 commented May 18, 2026

Copy link
Copy Markdown

It does not help either, how can I fix the issue if I'm unable to reproduce it?

Maybe I should have been more explicit from the start to suggest compiling this package as a great macro test case for reproducing the issues with your changes. The workflow is easy and fast on CachyOS (or any other Arch-Linux distro). The linked PKGBUILD implements a sophisticated PGO build run with a training workload automatically built-in.

This is my workflow:

  1. I put my custom compiler into PATH on the console, I do so from my home directory with: export PATH="/home/marcus/toolchain/llvm/stage1/bin:$PATH"

  2. I have Clang configured after the default GCC entries in /etc/makepkg.conf, I've reproduced the issue with these flags (which work fine with the latest revert):

export CC=clang
export CXX=clang++
export CC_LD=lld
export CXX_LD=lld
export AR=llvm-ar
export NM=llvm-nm
export STRIP=llvm-strip
export OBJCOPY=llvm-objcopy
export OBJDUMP=llvm-objdump
export READELF=llvm-readelf
export RANLIB=llvm-ranlib
export HOSTCC=clang
export HOSTCXX=clang++
export HOSTAR=llvm-ar
export CPPFLAGS="-D_FORTIFY_SOURCE=0"
export CFLAGS="-O3 -march=native -mtune=native -mllvm -inline-threshold=1500 -mllvm -always-rename-promoted-locals=false -mllvm -extra-vectorizer-passes -mllvm -slp-vectorize-hor-store -mllvm -enable-loopinterchange -mllvm -enable-loop-distribute -mllvm -enable-unroll-and-jam -mllvm -enable-loop-flatten -mllvm -unroll-runtime-multi-exit -mllvm -aggressive-ext-opt -mllvm -enable-interleaved-mem-accesses -mllvm -enable-masked-interleaved-mem-accesses -fno-math-errno -falign-functions=32 -funroll-loops -fno-semantic-interposition -fcf-protection=none -mharden-sls=none -fomit-frame-pointer -mprefer-vector-width=256 -flto -fwhole-program-vtables -fsplit-lto-unit -mllvm -adce-remove-loops -mllvm -enable-ext-tsp-block-placement=1 -mllvm -enable-gvn-hoist -mllvm -enable-dfa-jump-thread -Wno-error -fdata-sections -ffunction-sections -fsplit-machine-functions -fno-unique-section-names -fno-plt -fgnuc-version=16.1.1 -mtls-dialect=gnu2 -w"
export CXXFLAGS="${CFLAGS} -Wp,-U_GLIBCXX_ASSERTIONS"
export LDFLAGS="-Wl,--lto-CGO3 -Wl,--lto-whole-program-visibility -Wl,--gc-sections -Wl,--icf=all -Wl,--lto-O3,-O3,-Bsymbolic-functions,--as-needed -fcf-protection=none -mharden-sls=none -Wl,-mllvm -Wl,-extra-vectorizer-passes -Wl,-mllvm -Wl,-slp-vectorize-hor-store -Wl,-mllvm -Wl,-enable-loopinterchange -Wl,-mllvm -Wl,-enable-loop-distribute -Wl,-mllvm -Wl,-enable-unroll-and-jam -Wl,-mllvm -Wl,-enable-loop-flatten -Wl,-mllvm -Wl,-unroll-runtime-multi-exit -Wl,-mllvm -Wl,-aggressive-ext-opt -Wl,-mllvm -Wl,-enable-interleaved-mem-accesses -Wl,-mllvm -Wl,-enable-masked-interleaved-mem-accesses -march=native -flto -fwhole-program-vtables -fuse-ld=lld -Wl,-zmax-page-size=0x200000 -Wl,-mllvm -Wl,-adce-remove-loops -Wl,-mllvm -Wl,-enable-ext-tsp-block-placement=1 -Wl,-mllvm -Wl,-enable-gvn-hoist=1 -Wl,-mllvm -Wl,-enable-dfa-jump-thread=1 -Wl,-z,now -Wl,-z,relro -Wl,-z,pack-relative-relocs -Wl,--hash-style=gnu -Wl,--undefined-version"
export CCLDFLAGS="$LDFLAGS"
export CXXLDFLAGS="$LDFLAGS"
export ASFLAGS="-D__AVX__=1 -D__AVX2__=1 -D__FMA__=1"
  1. Make sure to copy the PKGBUILD and the avx2 patch from my repo into the same directory somewhere locally (I use Downloads/pixman-git).

  2. Start the build process from the console by getting into that directory (with the PKGBUILD and the avx2 patch) and start the build process with this command: makepkg --cleanbuild --skipchecksums --skippgpcheck -si

The build process should start automatically by fetching the relevant source files and installing all needed dependencies first, it will then start the build process and PGO training run automatically.

  1. Observe the tests that get executed during the PGO training workload phase. These should PASS ideally.

alexey-bataev added a commit that referenced this pull request May 18, 2026
… buildvector-only

In calculateTreeCostAndTrimNonProfitable, the subtree trim loop returns
Invalid when trimming node Idx==1 under an InsertElement root would
leave only a buildvector, to avoid infinite vectorization attempts.
This is too aggressive when the original untrimmed tree is already
profitable (Cost < -SLPCostThreshold). In that case, undo any partial
trims and return the original cost instead of rejecting the tree.

Original Pull Request: #197763

Recommit after unrelated revert in #198265

Reviewers: 

Pull Request: #198336
cpullvm-upstream-sync Bot pushed a commit to navaneethshan/cpullvm-toolchain-1 that referenced this pull request May 18, 2026
…d reduce to buildvector-only

In calculateTreeCostAndTrimNonProfitable, the subtree trim loop returns
Invalid when trimming node Idx==1 under an InsertElement root would
leave only a buildvector, to avoid infinite vectorization attempts.
This is too aggressive when the original untrimmed tree is already
profitable (Cost < -SLPCostThreshold). In that case, undo any partial
trims and return the original cost instead of rejecting the tree.

Original Pull Request: llvm/llvm-project#197763

Recommit after unrelated revert in llvm/llvm-project#198265

Reviewers:

Pull Request: llvm/llvm-project#198336
llvm-sync Bot pushed a commit to arm/arm-toolchain that referenced this pull request May 18, 2026
…d reduce to buildvector-only

In calculateTreeCostAndTrimNonProfitable, the subtree trim loop returns
Invalid when trimming node Idx==1 under an InsertElement root would
leave only a buildvector, to avoid infinite vectorization attempts.
This is too aggressive when the original untrimmed tree is already
profitable (Cost < -SLPCostThreshold). In that case, undo any partial
trims and return the original cost instead of rejecting the tree.

Original Pull Request: llvm/llvm-project#197763

Recommit after unrelated revert in llvm/llvm-project#198265

Reviewers:

Pull Request: llvm/llvm-project#198336
llvm-upstreamsync Bot pushed a commit to qualcomm/cpullvm-toolchain that referenced this pull request May 18, 2026
…d reduce to buildvector-only

In calculateTreeCostAndTrimNonProfitable, the subtree trim loop returns
Invalid when trimming node Idx==1 under an InsertElement root would
leave only a buildvector, to avoid infinite vectorization attempts.
This is too aggressive when the original untrimmed tree is already
profitable (Cost < -SLPCostThreshold). In that case, undo any partial
trims and return the original cost instead of rejecting the tree.

Original Pull Request: llvm/llvm-project#197763

Recommit after unrelated revert in llvm/llvm-project#198265

Reviewers:

Pull Request: llvm/llvm-project#198336
pedroMVicente pushed a commit to pedroMVicente/llvm-project that referenced this pull request May 19, 2026
It causes assertions failure such as this one. See discussion on the PR.

  Constants.cpp:2802:
static Constant *llvm::ConstantExpr::getInsertElement(Constant *,
Constant *, Constant *, Type *): Assertion `Val->getType()->isVectorTy()
&&
  "Tried to create insertelement operation on non-vector type!"' failed.

> Allow SLP to combine across lanes calls that return a literal struct
> (llvm.sincos, llvm.*.with.overflow, llvm.frexp, ...) into a single
> call returning a struct of vectors, by widening {T, T, ...} to
> {<VF x T>, ...} via VectorTypeUtils and emitting extractvalue +
> extractelement for external uses.
>
> Original Pull Request:
llvm#195521
>
> Original Pull Request2:
llvm#196756
>
> Recommit after revert llvm#197969
>
> Added check for valid vectorizable type.
>
> Reviewers:
>
> Pull Request: llvm#197994

This reverts commit 1c5e395
and the follow-up or dependent commits landed since:

aa2f124 [SLP] Enable full non-power-of-2 vectorization by default
6e8b6ef [SLP][REVEC] Fix crash on scalable vector types with
-slp-revec
8156fce [SLP] Prefer VF-matching scalar-set match in gather-shuffle
lookup
97ce93a [SLP]Consider non-profitable trees with buildvector of
struct-returning instructions
f0adfab [SLP] Preserve profitable trees when subtree trimming would
reduce to buildvector-only
pedroMVicente pushed a commit to pedroMVicente/llvm-project that referenced this pull request May 19, 2026
… buildvector-only

In calculateTreeCostAndTrimNonProfitable, the subtree trim loop returns
Invalid when trimming node Idx==1 under an InsertElement root would
leave only a buildvector, to avoid infinite vectorization attempts.
This is too aggressive when the original untrimmed tree is already
profitable (Cost < -SLPCostThreshold). In that case, undo any partial
trims and return the original cost instead of rejecting the tree.

Original Pull Request: llvm#197763

Recommit after unrelated revert in llvm#198265

Reviewers: 

Pull Request: llvm#198336
@alexey-bataev

Copy link
Copy Markdown
Member

It does not help either, how can I fix the issue if I'm unable to reproduce it?

Maybe I should have been more explicit from the start to suggest compiling this package as a great macro test case for reproducing the issues with your changes. The workflow is easy and fast on CachyOS (or any other Arch-Linux distro). The linked PKGBUILD implements a sophisticated PGO build run with a training workload automatically built-in.

This is my workflow:

  1. I put my custom compiler into PATH on the console, I do so from my home directory with: export PATH="/home/marcus/toolchain/llvm/stage1/bin:$PATH"
  2. I have Clang configured after the default GCC entries in /etc/makepkg.conf, I've reproduced the issue with these flags (which work fine with the latest revert):
export CC=clang
export CXX=clang++
export CC_LD=lld
export CXX_LD=lld
export AR=llvm-ar
export NM=llvm-nm
export STRIP=llvm-strip
export OBJCOPY=llvm-objcopy
export OBJDUMP=llvm-objdump
export READELF=llvm-readelf
export RANLIB=llvm-ranlib
export HOSTCC=clang
export HOSTCXX=clang++
export HOSTAR=llvm-ar
export CPPFLAGS="-D_FORTIFY_SOURCE=0"
export CFLAGS="-O3 -march=native -mtune=native -mllvm -inline-threshold=1500 -mllvm -always-rename-promoted-locals=false -mllvm -extra-vectorizer-passes -mllvm -slp-vectorize-hor-store -mllvm -enable-loopinterchange -mllvm -enable-loop-distribute -mllvm -enable-unroll-and-jam -mllvm -enable-loop-flatten -mllvm -unroll-runtime-multi-exit -mllvm -aggressive-ext-opt -mllvm -enable-interleaved-mem-accesses -mllvm -enable-masked-interleaved-mem-accesses -fno-math-errno -falign-functions=32 -funroll-loops -fno-semantic-interposition -fcf-protection=none -mharden-sls=none -fomit-frame-pointer -mprefer-vector-width=256 -flto -fwhole-program-vtables -fsplit-lto-unit -mllvm -adce-remove-loops -mllvm -enable-ext-tsp-block-placement=1 -mllvm -enable-gvn-hoist -mllvm -enable-dfa-jump-thread -Wno-error -fdata-sections -ffunction-sections -fsplit-machine-functions -fno-unique-section-names -fno-plt -fgnuc-version=16.1.1 -mtls-dialect=gnu2 -w"
export CXXFLAGS="${CFLAGS} -Wp,-U_GLIBCXX_ASSERTIONS"
export LDFLAGS="-Wl,--lto-CGO3 -Wl,--lto-whole-program-visibility -Wl,--gc-sections -Wl,--icf=all -Wl,--lto-O3,-O3,-Bsymbolic-functions,--as-needed -fcf-protection=none -mharden-sls=none -Wl,-mllvm -Wl,-extra-vectorizer-passes -Wl,-mllvm -Wl,-slp-vectorize-hor-store -Wl,-mllvm -Wl,-enable-loopinterchange -Wl,-mllvm -Wl,-enable-loop-distribute -Wl,-mllvm -Wl,-enable-unroll-and-jam -Wl,-mllvm -Wl,-enable-loop-flatten -Wl,-mllvm -Wl,-unroll-runtime-multi-exit -Wl,-mllvm -Wl,-aggressive-ext-opt -Wl,-mllvm -Wl,-enable-interleaved-mem-accesses -Wl,-mllvm -Wl,-enable-masked-interleaved-mem-accesses -march=native -flto -fwhole-program-vtables -fuse-ld=lld -Wl,-zmax-page-size=0x200000 -Wl,-mllvm -Wl,-adce-remove-loops -Wl,-mllvm -Wl,-enable-ext-tsp-block-placement=1 -Wl,-mllvm -Wl,-enable-gvn-hoist=1 -Wl,-mllvm -Wl,-enable-dfa-jump-thread=1 -Wl,-z,now -Wl,-z,relro -Wl,-z,pack-relative-relocs -Wl,--hash-style=gnu -Wl,--undefined-version"
export CCLDFLAGS="$LDFLAGS"
export CXXLDFLAGS="$LDFLAGS"
export ASFLAGS="-D__AVX__=1 -D__AVX2__=1 -D__FMA__=1"
  1. Make sure to copy the PKGBUILD and the avx2 patch from my repo into the same directory somewhere locally (I use Downloads/pixman-git).
  2. Start the build process from the console by getting into that directory (with the PKGBUILD and the avx2 patch) and start the build process with this command: makepkg --cleanbuild --skipchecksums --skippgpcheck -si

The build process should start automatically by fetching the relevant source files and installing all needed dependencies first, it will then start the build process and PGO training run automatically.

  1. Observe the tests that get executed during the PGO training workload phase. These should PASS ideally.

I'm unable to reproduce, please provide a reproducer

alexey-bataev added a commit that referenced this pull request May 24, 2026
Allow SLP to combine across lanes calls that return a literal struct
(llvm.sincos, llvm.*.with.overflow, llvm.frexp, ...) into a single
call returning a struct of vectors, by widening {T, T, ...} to
{<VF x T>, ...} via VectorTypeUtils and emitting extractvalue +
extractelement for external uses.

Original Pull Request: #195521

Original Pull Request2: #196756

Recommit after revert #198265 (comment)

Added check for valid vectorizable type, small corner cases fixes

Reviewers: 

Pull Request: #199433
llvm-upstreamsync Bot pushed a commit to qualcomm/cpullvm-toolchain that referenced this pull request May 24, 2026
Allow SLP to combine across lanes calls that return a literal struct
(llvm.sincos, llvm.*.with.overflow, llvm.frexp, ...) into a single
call returning a struct of vectors, by widening {T, T, ...} to
{<VF x T>, ...} via VectorTypeUtils and emitting extractvalue +
extractelement for external uses.

Original Pull Request: llvm/llvm-project#195521

Original Pull Request2: llvm/llvm-project#196756

Recommit after revert llvm/llvm-project#198265 (comment)

Added check for valid vectorizable type, small corner cases fixes

Reviewers:

Pull Request: llvm/llvm-project#199433
llvm-sync Bot pushed a commit to arm/arm-toolchain that referenced this pull request May 24, 2026
Allow SLP to combine across lanes calls that return a literal struct
(llvm.sincos, llvm.*.with.overflow, llvm.frexp, ...) into a single
call returning a struct of vectors, by widening {T, T, ...} to
{<VF x T>, ...} via VectorTypeUtils and emitting extractvalue +
extractelement for external uses.

Original Pull Request: llvm/llvm-project#195521

Original Pull Request2: llvm/llvm-project#196756

Recommit after revert llvm/llvm-project#198265 (comment)

Added check for valid vectorizable type, small corner cases fixes

Reviewers:

Pull Request: llvm/llvm-project#199433
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants