Limit possibility of file corruption#58
Conversation
|
|
||
| // Update the offsetInFileAtEndOfNewestMessage in our header and reader such that we will ignore the now-deleted data in the file. | ||
| // If the application crashes between writing this header and writing the next message data, we'll have only lost the messages we were about to delete. | ||
| // If `writer.offsetInFile == FileHeader.expectedEndOfHeaderInFile`, then crashing before we update the header would lead to this now-empty file being viewed as corrupted. |
There was a problem hiding this comment.
I couldn't think of a way to eliminate this issue. If we truncate after writing the header, then we can still end up with file corruption because our reader doesn't actually know when the file should stop before it reaches maximumBytes. So if we had:
[header][length|long-message7][length|message5][length|message6]partial-remnant-of-message-3]
^ reader ^ maximumBytes
our nextEncodedMessageSpan() method would try to read the length of the the partial remnant of the message three, and we'd end up trying to decode garbage.
Ideally our header would keep track of the last non-garbage byte in the file and we wouldn't need to truncate the file, but that's a much larger lift.
There was a problem hiding this comment.
I agree it's preferable to truncate before updating the offset in the file of the end of the newest message.
There was a problem hiding this comment.
Ideally our header would keep track of the last non-garbage byte in the file and we wouldn't need to truncate the file, but that's a much larger lift.
This makes sense though I worry that there's a new chance of getting out of sync. Let's say we started with an empty file and write a message. We'd need to update both the offset of the end of the newest message, AND the offset of the last non-garbage byte in the file. We can't update both of those values atomically. In writing this comment I also realized (or re-realized) that none of our writes are guaranteed to be atomic, which is a scary thought.
There was a problem hiding this comment.
Yes the approach of keeping track of the last non-garbage byte in the file does lead to more chances that we get out of sync. However, we could:
- Write the
header.offsetOfEndOfFile. If we crash here we've just lost the data we were about to delete. We can accomplish this by making our reading code trustoffsetOfEndOfFilemore than it trustsupdateOffsetInFileAtEndOfNewestMessage(i.e.if offsetOfEndOfFile < updateOffsetInFileAtEndOfNewestMessage { /* assume updateOffsetInFileAtEndOfNewestMessage is corrupt and trust offsetOfEndOfFile instead */ }). - Write the
header.offsetInFileAtEndOfNewestMessage. If we crash here we've just lost the data we were about to delete.
Basically I think adding this data to the header is workable, plugs a known hole in our existing logic, and would not lead to more potential file corruption. That said I don't think plugging this hole is worth the additional effort. Right now a crash during a write won't lead to data loss, which feels like the right bar to hit.
There was a problem hiding this comment.
Yes I had considered your approach in (1) for this situation and concluded that it could work. I haven't thought deeply about whether if offsetOfEndOfFile < updateOffsetInFileAtEndOfNewestMessage { /* assume updateOffsetInFileAtEndOfNewestMessage is corrupt and trust offsetOfEndOfFile instead */ } is an accurate assumption in all cases.
| // This issue will not occur in a current version of the library, but an earlier version was capable of creating this corruption. | ||
| // This inconsistency is likely due to a crash occurring during a message write, between truncating the file and updating the header values to reflect that the file was truncated. | ||
| // This file is actually empty, despite what the header tells us. | ||
| throw CacheAdvanceError.fileCorrupted |
There was a problem hiding this comment.
Thought... If we have an empty read can we check if the reader offset is at the end of the header and check if the rest of the file is 0s? We believe that's the situation that caused the infinite recursion. If those conditions are met, we can confidently return [] as the messages, since the file has no data. If the rest of the file is not all 0s we can throw CacheAdvanceError.fileCorrupted.
What I like about this direction is that we are only throwing CacheAdvanceError.fileCorrupted for cases where we enter situation that is truly unexplainable. In this situation though the file does have enough information for us to know what state we are in... it's just unfortunate that the information is not completely in the header.
If we had more time we could consider if it's possible to rethink our header design to eliminate this edge case where we need to check if the file has all 0s.
There was a problem hiding this comment.
I think this would be a nice improvement over what we have, but since it's not that trivial to write and we're talking about a very, very rare case I'm not sure it's worth the effort. I do agree though that throwing fileCorrupted when we can validate that in fact we really have no messages isn't great. But I think it's also reasonable (though not ideal) to say that the file is corrupted when we know for certain that the header and the file contents are out of sync. And since we are expecting that this case is due to the file being unexpectedly empty, we expect that throwing here doesn't translate to lost data.
There was a problem hiding this comment.
Per discussion elsewhere on this PR I am no longer convinced this comment is correct.
|
|
||
| // Update the offsetInFileAtEndOfNewestMessage in our header and reader such that we will ignore the now-deleted data in the file. | ||
| // If the application crashes between writing this header and writing the next message data, we'll have only lost the messages we were about to delete. | ||
| // If `writer.offsetInFile == FileHeader.expectedEndOfHeaderInFile`, then crashing before we update the header would lead to this now-empty file being viewed as corrupted. |
There was a problem hiding this comment.
I agree it's preferable to truncate before updating the offset in the file of the end of the newest message.
|
|
||
| // Update the offsetInFileAtEndOfNewestMessage in our header and reader such that we will ignore the now-deleted data in the file. | ||
| // If the application crashes between writing this header and writing the next message data, we'll have only lost the messages we were about to delete. | ||
| // If `writer.offsetInFile == FileHeader.expectedEndOfHeaderInFile`, then crashing before we update the header would lead to this now-empty file being viewed as corrupted. |
There was a problem hiding this comment.
Ideally our header would keep track of the last non-garbage byte in the file and we wouldn't need to truncate the file, but that's a much larger lift.
This makes sense though I worry that there's a new chance of getting out of sync. Let's say we started with an empty file and write a message. We'd need to update both the offset of the end of the newest message, AND the offset of the last non-garbage byte in the file. We can't update both of those values atomically. In writing this comment I also realized (or re-realized) that none of our writes are guaranteed to be atomic, which is a scary thought.
bachand
left a comment
There was a problem hiding this comment.
I had another comment I'm chewing on 🤔
|
|
||
| // Update the offsetInFileAtEndOfNewestMessage in our header and reader such that we will ignore the now-deleted data in the file. | ||
| // If the application crashes between writing this header and writing the next message data, we'll have only lost the messages we were about to delete. | ||
| // If `writer.offsetInFile == FileHeader.expectedEndOfHeaderInFile`, then crashing before we update the header would lead to this now-empty file being viewed as corrupted. |
There was a problem hiding this comment.
Yes I had considered your approach in (1) for this situation and concluded that it could work. I haven't thought deeply about whether if offsetOfEndOfFile < updateOffsetInFileAtEndOfNewestMessage { /* assume updateOffsetInFileAtEndOfNewestMessage is corrupt and trust offsetOfEndOfFile instead */ } is an accurate assumption in all cases.
| // Update the offsetInFileAtEndOfNewestMessage in our header and reader such that we will ignore the now-deleted data in the file. | ||
| // If the application crashes between writing this header and writing the next message data, we'll have only lost the messages we were about to delete. | ||
| // If `writer.offsetInFile == FileHeader.expectedEndOfHeaderInFile`, then crashing before we update the header would lead to this now-empty file being viewed as corrupted. | ||
| try header.updateOffsetInFileAtEndOfNewestMessage(to: writer.offsetInFile) |
There was a problem hiding this comment.
I want to dig into this a bit more... Isn't the writer always at the end of the newest message? Under what circumstances would this line of code have an effect?
There was a problem hiding this comment.
good point! this line is indeed duplicative.
so... now I'm really getting lost on how our file got into the corrupted state in #52. It's almost as if we wrote an empty message to the file... but that shouldn't be possible?!
| // If the application crashes between writing the header and writing the message data, we'll have lost the messages between the previous offsetInFileOfOldestMessage and the new offsetInFileOfOldestMessage. | ||
| try header.updateOffsetInFileOfOldestMessage(to: offsetInFileOfOldestMessage) | ||
|
|
||
| // Truncate the file if it needs truncation before we write the next message, and after we update our header. |
There was a problem hiding this comment.
Thinking about this more I'm not sure there is a difference from a data safety POV in truncating the file before or after we update the offset of the start of the oldest message:
- If the offset of the start of the oldest message is before the offset of the end of the newest message there's no way we can ever read into the bytes after the end of the newest message.
- If the offset of the start of the oldest message and the offset of the end of the newest message are both at the end of the header we will never truncate (see Prevent infinite recursion loop while reading messages #52 (comment)).
- We only need to consider the case where the offset of the start of the oldest message is after the offset of the end of the newest message.
For (3) if we truncate immediately (as we are changing this code to do in this PR), we could crash after truncation and before we update the offset of the start of the oldest message. In this situation the offset of the start of the oldest message is in the truncation region. When the file was read again we'd have an empty read and go back to the beginning of the file.
For (3) if we truncate after updating the offset of the start of the oldest message we could crash before truncating. In this situation we'd have non-0 bytes at the end of the file. But the reader will never get to these bytes since they are after the offset of the end of the newest message, like in the first situation.
@dfed do you agree with this reasoning? I'm OK with truncating earlier but I wanted to make sure I understand.
There was a problem hiding this comment.
For (3) if we truncate immediately (as we are changing this code to do in this PR), we could crash after truncation and before we update the offset of the start of the oldest message. In this situation the offset of the start of the oldest message is in the truncation region. When the file was read again we'd have an empty read and go back to the beginning of the file.
This is my understanding.
For (3) if we truncate after updating the offset of the start of the oldest message we could crash before truncating. In this situation we'd have non-0 bytes at the end of the file. But the reader will never get to these bytes since they are after the offset of the end of the newest message, like in the first situation.
Agreed.
do you agree with this reasoning?
Yes. And I'm coming to realize that I don't think that #52 or this PR actually reduced (or increased) the likelihood of a file corruption. #52 did help handle the file corruption in a better wya than we were doing before. But I no longer think I understand how we ended up with the file corruption, which isn't great.
@bachand's comment led me to realize that the fix in #52 didn't actually prevent file corruption from happening. Moreover, it is not possible to prevent file corruption entirely!
This PR attempts to minimize the likelihood that we will encounter two
emptyReads in a row and throw a file corruption error here:CacheAdvance/Sources/CacheAdvance/CacheReader.swift
Lines 63 to 69 in ed9ece4
To repeat some of what I wrote in response to Michael's review, we can encounter two
emptyReads in a row when the:offsetInFileAtEndOfNewestMessageis beyond the actual end of the fileWe know that in v1.2.0, it is possible to encounter two
emptyReads in a row because we saw it happen within the Airbnb app.We know that the only way for
offsetInFileAtEndOfNewestMessageto be beyond the actual end of the file is if we deleted part of the file before updating the header with our next write. This means that my root cause fix in #52 wasn't correct. In #52 I attempted to delete data from the file after updating the header'supdateOffsetInFileOfOldestMessage, when in fact the issue is that we need to delete this data after updating the header'soffsetInFileAtEndOfNewestMessage.While I was here I fixed an indentation issue. I recommend reviewing this PR commit-by-commit.