HDDS-10652. EC Reconstruction fails with "IOException: None of the block data have checksum" after upgrade#6520
Conversation
…ption: None of the block data have checksum"
| break; | ||
| } else { | ||
| ChunkInfo chunk = chunks.get(0); | ||
| LOG.info("The first chunk in block with index {} does not have stripeChecksum. BlockID: {}, Block size: {}." + |
There was a problem hiding this comment.
The checksum data is only stored in replica indexes 1 and the parity indexes. I think the BlockData[] array could hold any indexes, so it would be expected for some of them to not have it. Therefore this log might be a bit noisy and cause confusion.
I think it would be OK to make this debug, or just depend on the log you added below that warns if there was none found in any indexes.
There was a problem hiding this comment.
Yeah good point, I'll make it a debug log.
| } | ||
|
|
||
| BlockData checksumBlockData = null; | ||
| boolean foundStripeChecksum = false; |
There was a problem hiding this comment.
We can use checkSumBlockData == null to check if it was found, rather than adding a new boolean I think.
There was a problem hiding this comment.
I'm using this boolean to log only when stripeChecksum wasn't found, independent of whether checksumData was found or not. It's just to indicate that stripeChecksum wasn't found in this version, which can be helpful when debugging.
- Use the boolean to log when stripeChecksum wasn't found in any index.
- Instead of throwing, log at line 160 if checksumData was also not found (based on your comment below).
| if (!foundStripeChecksum) { | ||
| LOG.warn("Could not find stripeChecksum in any index for blockData with BlockID {}, length {} and " + | ||
| "blockGroupLength {}.", blockID, blockData.length, blockGroupLength); | ||
| } |
There was a problem hiding this comment.
Need to change the exception at line 160 to not throw if nothing has been found too. Infact, this log could just go into the else at line 160, rather than the throw.
|
@sodonnel Thanks for reviewing. Addressed your comments in the latest commit, yet to add testing. |
| .hasStripeChecksum()) { | ||
| checksumBlockData = bd; | ||
| break; | ||
| if (chunks != null && chunks.size() > 0) { |
There was a problem hiding this comment.
The original logic had:
if (chunks != null && chunks.size() > 0 && chunks.get(0)
.hasStripeChecksum()) {
So that checksumBlockData is only set if hasStripeChecksum()) returns true. With the change you have made, checksumBlockData will be set if hasChecksumData(), but it might not have stripeChecksum in it.
Then at line 155 it will enter the IF block and at line 174 I am not sure what will happen if stripeCheckSum is missing.
All the IF statement at 155 does is copy in the stripeChecksum if it exists. So if it does not exist, there is no point in going into that IF at all, as we will just be copying the original chunkChecksum out and back in again.
Following on from this - we only set checksumBlockData if there is a stripeChecksum, then you can also remove the foundStripeChecksum boolean as checksumBlockData != null means the same thing.
There was a problem hiding this comment.
Ah, so what you're saying is that there's no point in going inside the if block at line 155 if stripeChecksum isn't found because that code is only setting stripeChecksum. So overall, the only change we need here is to log instead of throw at line 184. Right?
| } | ||
| } | ||
|
|
||
| if (!foundStripeChecksum) { |
There was a problem hiding this comment.
I would move this log down to the else statement that is currently at line 184, as that else only executed if foundStripeChecksum is false (via chunkChecksumData == null).
|
I've added a unit test. The PR is ready for another round of reviews. |
|
Thanks @siddhantsangwan for the fix, @sodonnel for the review. |
…ock data have checksum" after upgrade (apache#6520) (cherry picked from commit 99a5703)
…ne of the block data have checksum" after upgrade (apache#6520) (cherry picked from commit 99a5703) Conflicts: hadoop-hdds/client/src/test/java/org/apache/hadoop/hdds/scm/storage/TestBlockOutputStreamCorrectness.java Change-Id: I97d0bcc631e1d0b70a1850a1e113623074918ef3
…ock data have checksum" after upgrade (apache#6520) (cherry picked from commit 99a5703)
…ock data have checksum" after upgrade (apache#6520) (cherry picked from commit 99a5703)
…ock data have checksum" after upgrade (apache#6520) (cherry picked from commit 99a5703)
…ock data have checksum" after upgrade (apache#6520) (cherry picked from commit 99a5703)
…ock data have checksum" after upgrade (apache#6520) (cherry picked from commit 99a5703)
…ock data have checksum" after upgrade (apache#6520)
What changes were proposed in this pull request?
In this scenario, EC reconstruction is failing in the new Ozone version for EC data written in the old version because the
stripeChecksumfield in theChunkInfoproto breaks compatibility between the two versions.Please check the Jira for a description and analysis of the problem. In this pull request, I changed the code so that reconstruction doesn't fail if
stripeChecksumis missing. Also added some logging.What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-10652
How was this patch tested?
Added a unit test.