HDDS-6795: EC: PipelineStateMap#addPipeline should not have precondition checks post db updates#3453
Conversation
There was a problem hiding this comment.
Do you think we should still have a check here, and at least log if the pipeline does not have enough nodes and skip it, rather than loading it? I guess we should not get a "bad pipeline" here, as it never should be allowed to be added anyway.
There was a problem hiding this comment.
I think it make sense to leave this preconditions as is. Anyway we are making sure to pass the correct pipeline with other checks.
There was a problem hiding this comment.
As this change is required due what we suspect is a bug in the RackScatterPlacementPolicy, where it returns less nodes than expected without giving an exception, I think we should make a change to the RackScatter policy too, so that it throws an exception indicating it is about to return successfully with less than expected. You could add a reference to this Jira in a comment so we can remove it later if we get to the bottom of the suspected bug.
There was a problem hiding this comment.
oh, yeah I remember we talked about this. Let me add that.
There was a problem hiding this comment.
agree with stephen. maybe we would better to check and handle this in that certain placement policy. if the placement policy can not choose enough datanodes for the new pipeline , an exception should be thrown from it.
There was a problem hiding this comment.
Moved the check into pipelineFactory#create
c7f1238 to
ca6cb66
Compare
| if (nodesRequiredToChoose != chosenNodes.size()) { | ||
| String reason = "Chosen nodes size: " + chosenNodes | ||
| .size() + ", but required nodes to choose: " + nodesRequiredToChoose | ||
| + " do not match."; | ||
| LOG.warn("Placement policy could not choose the enough nodes." | ||
| + " {} Available nodes count: {}, Excluded nodes count: {}", | ||
| reason, totalNodesCount, excludedNodesCount); | ||
| throw new SCMException(reason, | ||
| SCMException.ResultCodes.FAILED_TO_FIND_HEALTHY_NODES); | ||
| } |
There was a problem hiding this comment.
I wonder if validateContainerPlacement should verify the number of chosen nodes vs. required nodes (either in base class or this specific subclass).
There was a problem hiding this comment.
I assumed that is more validating the placement with respective to racks. Some how we are still seeing the issue of getting less nodes than requested.
There was a problem hiding this comment.
Makes sense, I see the message is referencing racks.
|
Thanks @adoroszlai for the review |
* master: (87 commits) HDDS-6686. Do Leadship check before SASL token verification. (apache#3382) HDDS-4364: [FSO]List FileStatus : startKey can be a non-existed path (apache#3481) HDDS-6091. Add file checksum to OmKeyInfo (apache#3201) HDDS-6706. Exposing Volume Information Metrics to the DataNode UI (apache#3478) HDDS-6759: Add listblock API in MockDatanodeStorage (apache#3452) HDDS-5821 Container cache management for closing RockDB (apache#3426) HDDS-6683. Refactor OM server bucket layout configuration usage (apache#3477) HDDS-6824. Revert changes made in proto.lock by HDDS-6768. (apache#3480) HDDS-6811. Bucket create message with layout type (apache#3479) HDDS-6810. Add a optional flag to trigger listStatus as part of listKeys for FSO buckets. (apache#3461) HDDS-6828. Revert RockDB version pending leak fixes (apache#3475) HDDS-6764: EC: DN ability to create RECOVERING containers for EC reconstruction. (apache#3458) HDDS-6795: EC: PipelineStateMap#addPipeline should not have precondition checks post db updates (apache#3453) HDDS-6823. Intermittent failure in TestOzoneECClient#testExcludeOnDNMixed (apache#3476) HDDS-6820. Bucket Layout Post-Finalization Validators for ACL Requests. (apache#3472) HDDS-6819. Add LEGACY to AllowedBucketLayouts in CreateBucketHandler (apache#3473) HDDS-4859. [FSO]ListKeys: seek all the files/dirs from startKey to keyPrefix (apache#3466) HDDS-6705 Add metrics for volume statistics including disk capacity, usage, Reserved (apache#3430) HDDS-6474. Add test to cover the FSO bucket list status with beyond batch boundary and cache. (apache#3379). Contributed by aswinshakil HDDS-6280. Support Container Balancer HA (apache#3423) ...
What changes were proposed in this pull request?
Moved the condition before getting pipeline. We should not have sanity checks after adding pipelines into DB. If there is an issue and failed with this sanity checks, SCM can crash and it will never come back as the same issue can trigger again when loading them from db.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-6795
How was this patch tested?
Checking on existing tests.