Skip to content

Limit possibility of file corruption#58

Merged
dfed merged 3 commits into
mainfrom
dfed--better-file-corruption-fix
Mar 11, 2022
Merged

Limit possibility of file corruption#58
dfed merged 3 commits into
mainfrom
dfed--better-file-corruption-fix

Conversation

@dfed

@dfed dfed commented Mar 10, 2022

Copy link
Copy Markdown
Owner

@bachand's comment led me to realize that the fix in #52 didn't actually prevent file corruption from happening. Moreover, it is not possible to prevent file corruption entirely!

This PR attempts to minimize the likelihood that we will encounter two emptyReads in a row and throw a file corruption error here:

guard !previousReadWasEmpty else {
// If the previous read was also empty, then the file has been corrupted.
// Two empty reads in a row means that offsetInFileAtEndOfNewestMessage is incorrect.
// This inconsistency is likely due to a crash occurring during a message write.
// This issue will not occur in a current version of the library, but an earlier version was capable of creating this corruption.
throw CacheAdvanceError.fileCorrupted
}

To repeat some of what I wrote in response to Michael's review, we can encounter two emptyReads in a row when the:

  1. offsetInFileAtEndOfNewestMessage is beyond the actual end of the file
  2. The file is empty except for the header

We know that in v1.2.0, it is possible to encounter two emptyReads in a row because we saw it happen within the Airbnb app.

We know that the only way for offsetInFileAtEndOfNewestMessage to be beyond the actual end of the file is if we deleted part of the file before updating the header with our next write. This means that my root cause fix in #52 wasn't correct. In #52 I attempted to delete data from the file after updating the header's updateOffsetInFileOfOldestMessage, when in fact the issue is that we need to delete this data after updating the header's offsetInFileAtEndOfNewestMessage.

While I was here I fixed an indentation issue. I recommend reviewing this PR commit-by-commit.


// Update the offsetInFileAtEndOfNewestMessage in our header and reader such that we will ignore the now-deleted data in the file.
// If the application crashes between writing this header and writing the next message data, we'll have only lost the messages we were about to delete.
// If `writer.offsetInFile == FileHeader.expectedEndOfHeaderInFile`, then crashing before we update the header would lead to this now-empty file being viewed as corrupted.

@dfed dfed Mar 10, 2022

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't think of a way to eliminate this issue. If we truncate after writing the header, then we can still end up with file corruption because our reader doesn't actually know when the file should stop before it reaches maximumBytes. So if we had:

[header][length|long-message7][length|message5][length|message6]partial-remnant-of-message-3]
                                                                ^ reader                      ^ maximumBytes

our nextEncodedMessageSpan() method would try to read the length of the the partial remnant of the message three, and we'd end up trying to decode garbage.

Ideally our header would keep track of the last non-garbage byte in the file and we wouldn't need to truncate the file, but that's a much larger lift.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it's preferable to truncate before updating the offset in the file of the end of the newest message.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally our header would keep track of the last non-garbage byte in the file and we wouldn't need to truncate the file, but that's a much larger lift.

This makes sense though I worry that there's a new chance of getting out of sync. Let's say we started with an empty file and write a message. We'd need to update both the offset of the end of the newest message, AND the offset of the last non-garbage byte in the file. We can't update both of those values atomically. In writing this comment I also realized (or re-realized) that none of our writes are guaranteed to be atomic, which is a scary thought.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the approach of keeping track of the last non-garbage byte in the file does lead to more chances that we get out of sync. However, we could:

  1. Write the header.offsetOfEndOfFile. If we crash here we've just lost the data we were about to delete. We can accomplish this by making our reading code trust offsetOfEndOfFile more than it trusts updateOffsetInFileAtEndOfNewestMessage (i.e. if offsetOfEndOfFile < updateOffsetInFileAtEndOfNewestMessage { /* assume updateOffsetInFileAtEndOfNewestMessage is corrupt and trust offsetOfEndOfFile instead */ }).
  2. Write the header.offsetInFileAtEndOfNewestMessage. If we crash here we've just lost the data we were about to delete.

Basically I think adding this data to the header is workable, plugs a known hole in our existing logic, and would not lead to more potential file corruption. That said I don't think plugging this hole is worth the additional effort. Right now a crash during a write won't lead to data loss, which feels like the right bar to hit.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I had considered your approach in (1) for this situation and concluded that it could work. I haven't thought deeply about whether if offsetOfEndOfFile < updateOffsetInFileAtEndOfNewestMessage { /* assume updateOffsetInFileAtEndOfNewestMessage is corrupt and trust offsetOfEndOfFile instead */ } is an accurate assumption in all cases.

@dfed dfed requested a review from bachand March 10, 2022 00:14
@dfed dfed marked this pull request as ready for review March 10, 2022 00:14
// This issue will not occur in a current version of the library, but an earlier version was capable of creating this corruption.
// This inconsistency is likely due to a crash occurring during a message write, between truncating the file and updating the header values to reflect that the file was truncated.
// This file is actually empty, despite what the header tells us.
throw CacheAdvanceError.fileCorrupted

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought... If we have an empty read can we check if the reader offset is at the end of the header and check if the rest of the file is 0s? We believe that's the situation that caused the infinite recursion. If those conditions are met, we can confidently return [] as the messages, since the file has no data. If the rest of the file is not all 0s we can throw CacheAdvanceError.fileCorrupted.

What I like about this direction is that we are only throwing CacheAdvanceError.fileCorrupted for cases where we enter situation that is truly unexplainable. In this situation though the file does have enough information for us to know what state we are in... it's just unfortunate that the information is not completely in the header.

If we had more time we could consider if it's possible to rethink our header design to eliminate this edge case where we need to check if the file has all 0s.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be a nice improvement over what we have, but since it's not that trivial to write and we're talking about a very, very rare case I'm not sure it's worth the effort. I do agree though that throwing fileCorrupted when we can validate that in fact we really have no messages isn't great. But I think it's also reasonable (though not ideal) to say that the file is corrupted when we know for certain that the header and the file contents are out of sync. And since we are expecting that this case is due to the file being unexpectedly empty, we expect that throwing here doesn't translate to lost data.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per discussion elsewhere on this PR I am no longer convinced this comment is correct.


// Update the offsetInFileAtEndOfNewestMessage in our header and reader such that we will ignore the now-deleted data in the file.
// If the application crashes between writing this header and writing the next message data, we'll have only lost the messages we were about to delete.
// If `writer.offsetInFile == FileHeader.expectedEndOfHeaderInFile`, then crashing before we update the header would lead to this now-empty file being viewed as corrupted.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it's preferable to truncate before updating the offset in the file of the end of the newest message.


// Update the offsetInFileAtEndOfNewestMessage in our header and reader such that we will ignore the now-deleted data in the file.
// If the application crashes between writing this header and writing the next message data, we'll have only lost the messages we were about to delete.
// If `writer.offsetInFile == FileHeader.expectedEndOfHeaderInFile`, then crashing before we update the header would lead to this now-empty file being viewed as corrupted.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally our header would keep track of the last non-garbage byte in the file and we wouldn't need to truncate the file, but that's a much larger lift.

This makes sense though I worry that there's a new chance of getting out of sync. Let's say we started with an empty file and write a message. We'd need to update both the offset of the end of the newest message, AND the offset of the last non-garbage byte in the file. We can't update both of those values atomically. In writing this comment I also realized (or re-realized) that none of our writes are guaranteed to be atomic, which is a scary thought.

@dfed dfed merged commit 6bb2434 into main Mar 11, 2022

@bachand bachand left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had another comment I'm chewing on 🤔


// Update the offsetInFileAtEndOfNewestMessage in our header and reader such that we will ignore the now-deleted data in the file.
// If the application crashes between writing this header and writing the next message data, we'll have only lost the messages we were about to delete.
// If `writer.offsetInFile == FileHeader.expectedEndOfHeaderInFile`, then crashing before we update the header would lead to this now-empty file being viewed as corrupted.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I had considered your approach in (1) for this situation and concluded that it could work. I haven't thought deeply about whether if offsetOfEndOfFile < updateOffsetInFileAtEndOfNewestMessage { /* assume updateOffsetInFileAtEndOfNewestMessage is corrupt and trust offsetOfEndOfFile instead */ } is an accurate assumption in all cases.

// Update the offsetInFileAtEndOfNewestMessage in our header and reader such that we will ignore the now-deleted data in the file.
// If the application crashes between writing this header and writing the next message data, we'll have only lost the messages we were about to delete.
// If `writer.offsetInFile == FileHeader.expectedEndOfHeaderInFile`, then crashing before we update the header would lead to this now-empty file being viewed as corrupted.
try header.updateOffsetInFileAtEndOfNewestMessage(to: writer.offsetInFile)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to dig into this a bit more... Isn't the writer always at the end of the newest message? Under what circumstances would this line of code have an effect?

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point! this line is indeed duplicative.

so... now I'm really getting lost on how our file got into the corrupted state in #52. It's almost as if we wrote an empty message to the file... but that shouldn't be possible?!

// If the application crashes between writing the header and writing the message data, we'll have lost the messages between the previous offsetInFileOfOldestMessage and the new offsetInFileOfOldestMessage.
try header.updateOffsetInFileOfOldestMessage(to: offsetInFileOfOldestMessage)

// Truncate the file if it needs truncation before we write the next message, and after we update our header.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this more I'm not sure there is a difference from a data safety POV in truncating the file before or after we update the offset of the start of the oldest message:

  1. If the offset of the start of the oldest message is before the offset of the end of the newest message there's no way we can ever read into the bytes after the end of the newest message.
  2. If the offset of the start of the oldest message and the offset of the end of the newest message are both at the end of the header we will never truncate (see Prevent infinite recursion loop while reading messages #52 (comment)).
  3. We only need to consider the case where the offset of the start of the oldest message is after the offset of the end of the newest message.

For (3) if we truncate immediately (as we are changing this code to do in this PR), we could crash after truncation and before we update the offset of the start of the oldest message. In this situation the offset of the start of the oldest message is in the truncation region. When the file was read again we'd have an empty read and go back to the beginning of the file.

For (3) if we truncate after updating the offset of the start of the oldest message we could crash before truncating. In this situation we'd have non-0 bytes at the end of the file. But the reader will never get to these bytes since they are after the offset of the end of the newest message, like in the first situation.

@dfed do you agree with this reasoning? I'm OK with truncating earlier but I wanted to make sure I understand.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For (3) if we truncate immediately (as we are changing this code to do in this PR), we could crash after truncation and before we update the offset of the start of the oldest message. In this situation the offset of the start of the oldest message is in the truncation region. When the file was read again we'd have an empty read and go back to the beginning of the file.

This is my understanding.

For (3) if we truncate after updating the offset of the start of the oldest message we could crash before truncating. In this situation we'd have non-0 bytes at the end of the file. But the reader will never get to these bytes since they are after the offset of the end of the newest message, like in the first situation.

Agreed.

do you agree with this reasoning?

Yes. And I'm coming to realize that I don't think that #52 or this PR actually reduced (or increased) the likelihood of a file corruption. #52 did help handle the file corruption in a better wya than we were doing before. But I no longer think I understand how we ended up with the file corruption, which isn't great.

@dfed dfed mentioned this pull request Mar 11, 2022
dfed added a commit that referenced this pull request Mar 15, 2022
* Delete duplicative comment + code

* Remove incorrect assertion in comment

* Add test for file header length
@dfed dfed deleted the dfed--better-file-corruption-fix branch September 2, 2022 17:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants