HDDS-8131. Add Configuration for OM Ratis Log Purge Tuning Parameters.#4371
HDDS-8131. Add Configuration for OM Ratis Log Purge Tuning Parameters.#4371szetszwo merged 2 commits intoapache:masterfrom
Conversation
|
Hi @Xushaohong @szetszwo could you take a look? |
|
Hi @prashantpogde @hemantk-12 @GeorgeJahad I see that there is already an effort to introduce incremental checkpoint in the OM snapshot process. Our cluster is currently encountering issue in which a slow OM follower has to download a large OM metadata due to the leaders' log being purged. This merge request seeks to circumvent this issue by disabling However this has a risk in which the OM leader disk space could be filled quickly if the follower is down / very slow. Therefore I think the long-term solution would be to integrate incremental checkpoint. May I know what is the progress on the feature? We might be interested in integrating it in our cluster. |
szetszwo
left a comment
There was a problem hiding this comment.
+1 the change looks good.
|
@szetszwo Thank you for the review and merge. |
* master: (262 commits) HDDS-8153. Integrate ContainerBalancer with MoveManager (apache#4391) HDDS-8090. When getBlock from a datanode fails, retry other datanodes. (apache#4357) HDDS-8163 Use try-with-resources to ensure close rockdb connection in SstFilteringService (apache#4402) HDDS-8065. Provide GNU long options (apache#4394) HDDS-7930. [addendum] input stream does not refresh expired block token. HDDS-7930. input stream does not refresh expired block token. (apache#4378) HDDS-7740. [Snapshot] Implement SnapshotDeletingService (apache#4244) HDDS-8076. Use container cache in Key listing API. (apache#4346) HDDS-8091. [addendum] Generate list of config tags from ConfigTag enum - Hadoop 3.1 compatibility fix (apache#4374) HDDS-8144. TestDefaultCertificateClient#testTimeBeforeExpiryGracePeriod fails as we approach DST. (apache#4382) HDDS-8151. Support fine grained lifetime for root CA certificate (apache#4386) HDDS-8150. RpcClientTest and ConfigurationSourceTest not run due to naming convention (apache#4388) HDDS-8131. Add Configuration for OM Ratis Log Purge Tuning Parameters. (apache#4371) HDDS-8133. Create ozone sh key checksum command (apache#4375) HDDS-8142. Check if no entries in Block DB for a container on container delete (apache#4379) HDDS-8118. Fail container delete on non empty chunks dir (apache#4367) HDDS-8028. JNI for RocksDB SST Dump tool (apache#4315) HDDS-8129. ContainerStateMachine allows two different tasks with the same container id running in parallel. (apache#4370) HDDS-8119. Remove loosely related AutoCloseable from SendContainerOutputStream (apache#4368) close db connection (apache#4366) ...
What changes were proposed in this pull request?
Currently Ozone Manager enables
raft.server.log.purge.upto.snapshot.indexby default.However, for OM cluster with large metadata store, there might be a case where OM leader purge its Ratis logs before a slow follower replicated it to its log. This means that the follower needs to download the whole metadata store from the OM leader. This can be problematic if the metadata store in leader is too large.
We should add two configurations in OM to enable/disable Ratis purge parameters:
raft.server.log.purge.upto.snapshot.indextrueto preserve the current OM snapshot behaviorraft.server.log.purge.preservation.log.numWhat is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-8131
How was this patch tested?
Should have already be covered in Ratis test.