Optimizations by mutalibmohammed · Pull Request #191 · UoB-HPC/SimEng

mutalibmohammed · 2021-09-16T11:44:50Z

The main theme of this pull request is optimizing memory allocations/deallocations. It was found while profiling that on average, 10-20 memory allocations/deallocations would occur per cycle. By reducing the no. of allocations, I hope to reduce cycle time.

New language features and smaller changes were preferred over big structural changes to improve performance.

6bf6aad ... 5410557 primarily change function parameters to accept const references. This allows them to accept both lvalue and rvalue references; avoiding a copy.
member initializer list were also added/improved wherever possible. This change avoids having to a call the constructor of a member twice.

Several of these changes were suggested by the static analyzer CppCheck.

b0c1773 uses a new C++17 method try_emplace to construct an element in place avoiding a copy. It also has better syntax to pass constructor arguments compared to pre C++17 methods that require std::piecewise_construct.
16f2758 changes getSupportedPorts to return a reference to a vector. The aarch64::Instruction class has a member vector that stores the ports, and it is alive for as long as the instruction instance is, hence we return a reference to it to avoid copying.
a0aebb1 adds a new CMake flag SIMENG_OPTIMIZE. This flag turns on Link-Time Optimizations allowing the compiler to optimize across translation units. Additionally, on x86 systems, it sets the -march and -mtune compiler flags.
856dc96 overloads setMemoryAddresses to accept rvalue references. This overload can reuse the resources of its parameter. The vectors created in Instruction::generateAddresses are deleted at the end of function. This presents an opportunity to reuse resources.
bea1f88 creates a new member flushed_ to avoid allocating a map every flush. The sets are still allocated/deallocated every flush, perhaps using a stack allocator here could be beneficial.
f4a1dbb and ac83072 adds a second parameter capacity to the RegisterValue ctor that takes an array as a reference. In aarch64, FP, SVE and NEON instructions use the same register bank of 256 byte registers. The updated ctor makes it easier to create RegisterValue of the correct size.

Instruction::execute was updated to use the new ctor. This avoids data copying and an extra ctor call at the end of the function when zero extending.

The RegisterValue class along with Instruction class are a major source of memory allocations. The next few commits focus on reducing it.

b40fde0 experiments with using C++17's new std::unsynchronized_pool_resource. It is a general-purpose, non thread-safe memory pool implementation. However, it was abandoned as it was not implemented in LLVM/Clang.
a068449 experiments with using raw pointers instead of shared_ptr. This required implementing copy and move constructors. This change resulted in a performance regression and was reverted.
ee4ec4d and cbd0d39 implement a simple memory pool implementation that is based on a free list. Each node of the free list is stored in the free chunks of memory. The memory pool grows by a factor of 2 when exhausted.
ce708b0 changes the predict and update functions in BranchPredictor to accept a reference to shared_ptr. This is to avoid the cost of atomic increment/decrement. The lifetime of the ptr is expected to out last predict and update, hence it is safe.

These optimizations were suggested by cppcheck. The optimizations consist of using const reference in function arguments to avoid invoking the copy constructor and using intialization lists.

This change avoids a copy of the vector being returned.

Instead of creating a new vector every cycle to count the number of dispatched instructions, store the vector as a class member and reset it every cycle.

This option turns on link time optimization and sets -march and -mtune flags for all targets.

A map is used instead of a vector as it has amortised constant time erase complexity.

The ctor that accepts a reference to an array is updated to accept capacity as a second parameter. This allows us to create RegisterValue from arrays without the size of RegisterValue being bound to the array size. Type traits were introduced in the first ctor to stop it from accepting pointers as it's first parameter. If pointers were allowed, it would clash with array ctor in overload resolution.

Previously, the registers were extended at the end of the execute function. Now, they are created with the right size in the first place.

We are abandoning C++17 memory pool because it is not yet supported by LLVM/Clang. I hypothesize that using raw pointers could give a performance boost as we do not use any atomics. However, this means that we need to implement our own copy/move constructors. I have used the copy-and-swap idiom to implement them.

Using C++17 memory pool and raw pointers were unsuccesful in providing a performance boost, hence we are reverting back to the old state.

The purpose of implementing a memory pool is to reduce the cost of memory allocations by reusing memory. It was designed to be used primarily for the RegisterValue class. Code from Standford's Bit Twiddling Hacks is used to implement the roundUp function.

predict() and update() take a reference to the shared_ptr instead of making a copy. This goes against the intended use of shared_ptr, however we deviate from this rule to improve performance (avoiding atomic inc/dec).

jj16791 · 2021-09-16T13:14:12Z

#rerun tests

jj16791 · 2021-09-16T13:14:14Z

#rerun tests

FinnWilkinson · 2021-09-16T15:31:47Z

+// Temporary; until execute has been verified to work correctly.
+#ifndef NDEBUG
+#include <iostream>
+#endif
+


Still needed?

I think so. I have only tested it on my end, so it would be better if the team could test as well.

jj16791 · 2021-09-24T12:38:29Z

#rerun tests

jj16791 · 2021-09-24T12:54:37Z

#rerun tests

jj16791 · 2021-09-25T11:41:39Z

#rerun tests

jj16791 · 2021-09-25T12:01:07Z

#rerun tests

seunghun1ee

Looks good to me.

FinnWilkinson

All looks good

mutalibmohammed added 21 commits September 16, 2021 11:57

Optimize using basic language features

6bf6aad

These optimizations were suggested by cppcheck. The optimizations consist of using const reference in function arguments to avoid invoking the copy constructor and using intialization lists.

Use std library method to insert in unordered map

b0c1773

Change PortAllocator::allocate to accept reference

5410557

Reuse vectors when allocating instruction attribute

2483605

Change getSupportedPorts to return a reference

16f2758

This change avoids a copy of the vector being returned.

Store dispatches vector as a class member in DispatchIssueUnit

c280cbf

Instead of creating a new vector every cycle to count the number of dispatched instructions, store the vector as a class member and reset it every cycle.

Use array to count number of requests in LSQ

d6e260f

Optimize DispatchIssueUnit::Issue()

6d76648

Add CMake option SIMENG_OPTIMIZE

a0aebb1

This option turns on link time optimization and sets -march and -mtune flags for all targets.

Use move semantics when setting memory addresses

856dc96

Avoid temporary object when inserting in decodeCache

66e695c

Use shift instead of pow() to calculate power of 2

4c3d8d3

Create a map to store flushed instructions

bea1f88

A map is used instead of a vector as it has amortised constant time erase complexity.

Update instruction_execute to use the new array ctor

ac83072

Previously, the registers were extended at the end of the execute function. Now, they are created with the right size in the first place.

Use C++17 memory pool to allocate RegisterValue

b40fde0

Revert back to using shared_ptr in RegisterValue

cdd1ca7

Using C++17 memory pool and raw pointers were unsuccesful in providing a performance boost, hence we are reverting back to the old state.

Adapt RegisterValue class to use memory pool

cbd0d39

Update BranchPredictor to accecpt reference to shared_ptr

ce708b0

predict() and update() take a reference to the shared_ptr instead of making a copy. This goes against the intended use of shared_ptr, however we deviate from this rule to improve performance (avoiding atomic inc/dec).

mutalibmohammed requested review from FinnWilkinson, jj16791 and seunghun1ee September 16, 2021 12:27

mutalibmohammed added the performance Performance optimisation label Sep 16, 2021