Skip to content

[ARM] Reduce loop unroll when low overhead branching is available#120065

Merged
stuij merged 1 commit into
llvm:mainfrom
VladiKrapp-Arm:arm_no_deep_unroll_with_lob
Dec 18, 2024
Merged

[ARM] Reduce loop unroll when low overhead branching is available#120065
stuij merged 1 commit into
llvm:mainfrom
VladiKrapp-Arm:arm_no_deep_unroll_with_lob

Conversation

@VladiKrapp-Arm

@VladiKrapp-Arm VladiKrapp-Arm commented Dec 16, 2024

Copy link
Copy Markdown
Contributor

For processors with low overhead branching (LOB), runtime unrolling the innermost loop is often detrimental to performance. In these cases the loop remainder gets unrolled into a series of compare-and-jump blocks, which in deeply nested loops get executed multiple times, negating the benefits of LOB.
This is particularly noticable when the loop trip count of the innermost loop varies within the outer loop, such as in the case of triangular matrix decompositions.

In these cases we will prefer to not unroll the innermost loop, with the intention for it to be executed as a low overhead loop.

@llvmbot

llvmbot commented Dec 16, 2024

Copy link
Copy Markdown
Member

@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-backend-arm

Author: Vladi Krapp (VladiKrapp-Arm)

Changes

For processors with low overhead branching (LOB), runtime unrolling the innermost loop is often detrimental to performance. In these cases the loop remainder gets unrolled into a series of compare-and-jump blocks, which in deeply nested loops get executed multiple times, negating the benefits of LOB.
This is particularly noticable when the loop trip count of the innermost loop varies withing the outer loop, such as in the case of triangular matrix decompositions.

In these cases we will prefer to not unroll the innermost loop, with the intention for it to be executed as a low overhead loop.


Full diff: https://github.com/llvm/llvm-project/pull/120065.diff

2 Files Affected:

  • (modified) llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp (+13-1)
  • (modified) llvm/test/Transforms/LoopUnroll/ARM/lob-unroll.ll (+18-9)
diff --git a/llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp b/llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
index 0e29648a7a284f..d336a69718cd36 100644
--- a/llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
+++ b/llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
@@ -2592,11 +2592,23 @@ void ARMTTIImpl::getUnrollingPreferences(Loop *L, ScalarEvolution &SE,
       return;
   }
 
+  bool Runtime = true;
+  if (ST->hasLOB()) {
+    if (SE.hasLoopInvariantBackedgeTakenCount(L)) {
+      const auto *BETC = SE.getBackedgeTakenCount(L);
+      auto *Outer = L->getOutermostLoop();
+      if ((L != Outer && Outer != L->getParentLoop()) ||
+          (L != Outer && BETC && !SE.isLoopInvariant(BETC, Outer))) {
+        Runtime = false;
+      }
+    }
+  }
+
   LLVM_DEBUG(dbgs() << "Cost of loop: " << Cost << "\n");
   LLVM_DEBUG(dbgs() << "Default Runtime Unroll Count: " << UnrollCount << "\n");
 
   UP.Partial = true;
-  UP.Runtime = true;
+  UP.Runtime = Runtime;
   UP.UnrollRemainder = true;
   UP.DefaultUnrollRuntimeCount = UnrollCount;
   UP.UnrollAndJam = true;
diff --git a/llvm/test/Transforms/LoopUnroll/ARM/lob-unroll.ll b/llvm/test/Transforms/LoopUnroll/ARM/lob-unroll.ll
index b155f5d31045f9..111bc96b28806a 100644
--- a/llvm/test/Transforms/LoopUnroll/ARM/lob-unroll.ll
+++ b/llvm/test/Transforms/LoopUnroll/ARM/lob-unroll.ll
@@ -1,17 +1,23 @@
+; RUN: opt -mcpu=cortex-m7 -mtriple=thumbv8.1m.main -passes=loop-unroll -S  %s -o - | FileCheck %s --check-prefix=NLOB
 ; RUN: opt -mcpu=cortex-m55 -mtriple=thumbv8.1m.main -passes=loop-unroll -S  %s -o - | FileCheck %s --check-prefix=LOB
 
 ; This test checks behaviour of loop unrolling on processors with low overhead branching available 
 
-; LOB-CHECK-LABEL: for.body{{.*}}.prol
-; LOB-COUNT-1:     fmul fast float 
-; LOB-CHECK-LABEL: for.body{{.*}}.prol.1
-; LOB-COUNT-1:     fmul fast float 
-; LOB-CHECK-LABEL: for.body{{.*}}.prol.2
-; LOB-COUNT-1:     fmul fast float 
-; LOB-CHECK-LABEL: for.body{{.*}}
-; LOB-COUNT-4:     fmul fast float 
+; NLOB-LABEL: for.body{{.*}}.prol:
+; NLOB-COUNT-1:     fmul fast float 
+; NLOB-LABEL: for.body{{.*}}.prol.1:
+; NLOB-COUNT-1:     fmul fast float 
+; NLOB-LABEL: for.body{{.*}}.prol.2:
+; NLOB-COUNT-1:     fmul fast float 
+; NLOB-LABEL: for.body{{.*}}:
+; NLOB-COUNT-4:     fmul fast float 
+; NLOB-NOT:     fmul fast float 
+
+; LOB-LABEL: for.body{{.*}}:
+; LOB:     fmul fast float 
 ; LOB-NOT:     fmul fast float 
 
+
 ; Function Attrs: nofree norecurse nosync nounwind memory(argmem: readwrite)
 define dso_local void @test(i32 noundef %n, ptr nocapture noundef %pA) local_unnamed_addr #0 {
 entry:
@@ -20,7 +26,7 @@ entry:
 
 for.cond.loopexit:                                ; preds = %for.cond6.for.cond.cleanup8_crit_edge.us, %for.body
   %exitcond49.not = icmp eq i32 %add, %n
-  br i1 %exitcond49.not, label %for.cond.cleanup, label %for.body
+  br i1 %exitcond49.not, label %for.cond.cleanup, label %for.body, !llvm.loop !0
 
 for.cond.cleanup:                                 ; preds = %for.cond.loopexit, %entry
   ret void
@@ -61,3 +67,6 @@ for.cond6.for.cond.cleanup8_crit_edge.us:         ; preds = %for.body9.us
   br i1 %exitcond48.not, label %for.cond.loopexit, label %for.cond6.preheader.us
 }
 
+!0 = distinct !{!0, !1, !2}
+!1 = !{!"llvm.loop.mustprogress"}
+!2 = !{!"llvm.loop.unroll.disable"}

Comment thread llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
@davemgreen davemgreen requested a review from stuij December 16, 2024 18:28
For processors with low overhead branching (LOB), runtime unrolling the
innermost loop is often detrimental to performance.
In these cases the loop remainder gets unrolled into a series of
compare-and-jump blocks, which in deeply nested loops get executed multiple
times, negating the benefits of LOB.
This is particularly noticable when the loop trip count of the innermost
loop varies within the outer loop, such as in the case of triangular matrix
decompositions.

In these cases we will prefer to not unroll the innermost loop, with the
intention for it to be executed as a low overhead loop.
@VladiKrapp-Arm VladiKrapp-Arm force-pushed the arm_no_deep_unroll_with_lob branch from 72d5967 to eaaa7fd Compare December 17, 2024 09:27

@davemgreen davemgreen left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants