Conversation
These optimizations were suggested by cppcheck. The optimizations consist of using const reference in function arguments to avoid invoking the copy constructor and using intialization lists.
This change avoids a copy of the vector being returned.
Instead of creating a new vector every cycle to count the number of dispatched instructions, store the vector as a class member and reset it every cycle.
This option turns on link time optimization and sets -march and -mtune flags for all targets.
A map is used instead of a vector as it has amortised constant time erase complexity.
The ctor that accepts a reference to an array is updated to accept capacity as a second parameter. This allows us to create RegisterValue from arrays without the size of RegisterValue being bound to the array size. Type traits were introduced in the first ctor to stop it from accepting pointers as it's first parameter. If pointers were allowed, it would clash with array ctor in overload resolution.
Previously, the registers were extended at the end of the execute function. Now, they are created with the right size in the first place.
We are abandoning C++17 memory pool because it is not yet supported by LLVM/Clang. I hypothesize that using raw pointers could give a performance boost as we do not use any atomics. However, this means that we need to implement our own copy/move constructors. I have used the copy-and-swap idiom to implement them.
Using C++17 memory pool and raw pointers were unsuccesful in providing a performance boost, hence we are reverting back to the old state.
The purpose of implementing a memory pool is to reduce the cost of memory allocations by reusing memory. It was designed to be used primarily for the RegisterValue class. Code from Standford's Bit Twiddling Hacks is used to implement the roundUp function.
predict() and update() take a reference to the shared_ptr instead of making a copy. This goes against the intended use of shared_ptr, however we deviate from this rule to improve performance (avoiding atomic inc/dec).
Contributor
|
#rerun tests |
1 similar comment
Contributor
|
#rerun tests |
Comment on lines
+5
to
+9
| // Temporary; until execute has been verified to work correctly. | ||
| #ifndef NDEBUG | ||
| #include <iostream> | ||
| #endif | ||
|
|
Contributor
Author
There was a problem hiding this comment.
I think so. I have only tested it on my end, so it would be better if the team could test as well.
Contributor
|
#rerun tests |
3 similar comments
Contributor
|
#rerun tests |
Contributor
|
#rerun tests |
Contributor
|
#rerun tests |
jj16791
approved these changes
Sep 30, 2021
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The main theme of this pull request is optimizing memory allocations/deallocations. It was found while profiling that on average, 10-20 memory allocations/deallocations would occur per cycle. By reducing the no. of allocations, I hope to reduce cycle time.
New language features and smaller changes were preferred over big structural changes to improve performance.
6bf6aad ... 5410557 primarily change function parameters to accept const references. This allows them to accept both lvalue and rvalue references; avoiding a copy.
member initializer list were also added/improved wherever possible. This change avoids having to a call the constructor of a member twice.
b0c1773 uses a new C++17 method
try_emplaceto construct an element in place avoiding a copy. It also has better syntax to pass constructor arguments compared to pre C++17 methods that requirestd::piecewise_construct.16f2758 changes
getSupportedPortsto return a reference to a vector. Theaarch64::Instructionclass has a member vector that stores the ports, and it is alive for as long as the instruction instance is, hence we return a reference to it to avoid copying.a0aebb1 adds a new CMake flag
SIMENG_OPTIMIZE. This flag turns on Link-Time Optimizations allowing the compiler to optimize across translation units. Additionally, on x86 systems, it sets the-marchand-mtunecompiler flags.856dc96 overloads
setMemoryAddressesto accept rvalue references. This overload can reuse the resources of its parameter. The vectors created inInstruction::generateAddressesare deleted at the end of function. This presents an opportunity to reuse resources.bea1f88 creates a new member
flushed_to avoid allocating a map every flush. The sets are still allocated/deallocated every flush, perhaps using a stack allocator here could be beneficial.f4a1dbb and ac83072 adds a second parameter
capacityto theRegisterValuector that takes an array as a reference. In aarch64, FP, SVE and NEON instructions use the same register bank of 256 byte registers. The updated ctor makes it easier to createRegisterValueof the correct size.Instruction::executewas updated to use the new ctor. This avoids data copying and an extra ctor call at the end of the function when zero extending.b40fde0 experiments with using C++17's new
std::unsynchronized_pool_resource. It is a general-purpose, non thread-safe memory pool implementation. However, it was abandoned as it was not implemented in LLVM/Clang.a068449 experiments with using raw pointers instead of shared_ptr. This required implementing copy and move constructors. This change resulted in a performance regression and was reverted.
ee4ec4d and cbd0d39 implement a simple memory pool implementation that is based on a free list. Each node of the free list is stored in the free chunks of memory. The memory pool grows by a factor of 2 when exhausted.
ce708b0 changes the
predictandupdatefunctions inBranchPredictorto accept a reference to shared_ptr. This is to avoid the cost of atomic increment/decrement. The lifetime of the ptr is expected to out lastpredictandupdate, hence it is safe.