Limit possibility of file corruption by dfed · Pull Request #58 · dfed/CacheAdvance

dfed · 2022-03-10T00:07:14Z

@bachand's comment led me to realize that the fix in #52 didn't actually prevent file corruption from happening. Moreover, it is not possible to prevent file corruption entirely!

This PR attempts to minimize the likelihood that we will encounter two emptyReads in a row and throw a file corruption error here:

CacheAdvance/Sources/CacheAdvance/CacheReader.swift

Lines 63 to 69 in ed9ece4

    
           guard !previousReadWasEmpty else { 
        
               // If the previous read was also empty, then the file has been corrupted. 
        
               // Two empty reads in a row means that offsetInFileAtEndOfNewestMessage is incorrect. 
        
               // This inconsistency is likely due to a crash occurring during a message write. 
        
               // This issue will not occur in a current version of the library, but an earlier version was capable of creating this corruption. 
        
               throw CacheAdvanceError.fileCorrupted 
        
           }

To repeat some of what I wrote in response to Michael's review, we can encounter two emptyReads in a row when the:

offsetInFileAtEndOfNewestMessage is beyond the actual end of the file
The file is empty except for the header

We know that in v1.2.0, it is possible to encounter two emptyReads in a row because we saw it happen within the Airbnb app.

We know that the only way for offsetInFileAtEndOfNewestMessage to be beyond the actual end of the file is if we deleted part of the file before updating the header with our next write. This means that my root cause fix in #52 wasn't correct. In #52 I attempted to delete data from the file after updating the header's updateOffsetInFileOfOldestMessage, when in fact the issue is that we need to delete this data after updating the header's offsetInFileAtEndOfNewestMessage.

While I was here I fixed an indentation issue. I recommend reviewing this PR commit-by-commit.

dfed · 2022-03-10T00:13:24Z

+
+                // Update the offsetInFileAtEndOfNewestMessage in our header and reader such that we will ignore the now-deleted data in the file.
+                // If the application crashes between writing this header and writing the next message data, we'll have only lost the messages we were about to delete.
+                // If `writer.offsetInFile == FileHeader.expectedEndOfHeaderInFile`, then crashing before we update the header would lead to this now-empty file being viewed as corrupted.


I couldn't think of a way to eliminate this issue. If we truncate after writing the header, then we can still end up with file corruption because our reader doesn't actually know when the file should stop before it reaches maximumBytes. So if we had:

[header][length|long-message7][length|message5][length|message6]partial-remnant-of-message-3] ^ reader ^ maximumBytes

our nextEncodedMessageSpan() method would try to read the length of the the partial remnant of the message three, and we'd end up trying to decode garbage.

Ideally our header would keep track of the last non-garbage byte in the file and we wouldn't need to truncate the file, but that's a much larger lift.

I agree it's preferable to truncate before updating the offset in the file of the end of the newest message.

Ideally our header would keep track of the last non-garbage byte in the file and we wouldn't need to truncate the file, but that's a much larger lift.

This makes sense though I worry that there's a new chance of getting out of sync. Let's say we started with an empty file and write a message. We'd need to update both the offset of the end of the newest message, AND the offset of the last non-garbage byte in the file. We can't update both of those values atomically. In writing this comment I also realized (or re-realized) that none of our writes are guaranteed to be atomic, which is a scary thought.

Yes the approach of keeping track of the last non-garbage byte in the file does lead to more chances that we get out of sync. However, we could:

Write the header.offsetOfEndOfFile. If we crash here we've just lost the data we were about to delete. We can accomplish this by making our reading code trust offsetOfEndOfFile more than it trusts updateOffsetInFileAtEndOfNewestMessage (i.e. if offsetOfEndOfFile < updateOffsetInFileAtEndOfNewestMessage { /* assume updateOffsetInFileAtEndOfNewestMessage is corrupt and trust offsetOfEndOfFile instead */ }).

Write the header.offsetInFileAtEndOfNewestMessage. If we crash here we've just lost the data we were about to delete.

Basically I think adding this data to the header is workable, plugs a known hole in our existing logic, and would not lead to more potential file corruption. That said I don't think plugging this hole is worth the additional effort. Right now a crash during a write won't lead to data loss, which feels like the right bar to hit.

Yes I had considered your approach in (1) for this situation and concluded that it could work. I haven't thought deeply about whether if offsetOfEndOfFile < updateOffsetInFileAtEndOfNewestMessage { /* assume updateOffsetInFileAtEndOfNewestMessage is corrupt and trust offsetOfEndOfFile instead */ } is an accurate assumption in all cases.

bachand · 2022-03-10T23:50:22Z

-                // This issue will not occur in a current version of the library, but an earlier version was capable of creating this corruption.
+                // This inconsistency is likely due to a crash occurring during a message write, between truncating the file and updating the header values to reflect that the file was truncated.
+                // This file is actually empty, despite what the header tells us.
                throw CacheAdvanceError.fileCorrupted


Thought... If we have an empty read can we check if the reader offset is at the end of the header and check if the rest of the file is 0s? We believe that's the situation that caused the infinite recursion. If those conditions are met, we can confidently return [] as the messages, since the file has no data. If the rest of the file is not all 0s we can throw CacheAdvanceError.fileCorrupted.

What I like about this direction is that we are only throwing CacheAdvanceError.fileCorrupted for cases where we enter situation that is truly unexplainable. In this situation though the file does have enough information for us to know what state we are in... it's just unfortunate that the information is not completely in the header.

If we had more time we could consider if it's possible to rethink our header design to eliminate this edge case where we need to check if the file has all 0s.

I think this would be a nice improvement over what we have, but since it's not that trivial to write and we're talking about a very, very rare case I'm not sure it's worth the effort. I do agree though that throwing fileCorrupted when we can validate that in fact we really have no messages isn't great. But I think it's also reasonable (though not ideal) to say that the file is corrupted when we know for certain that the header and the file contents are out of sync. And since we are expecting that this case is due to the file being unexpectedly empty, we expect that throwing here doesn't translate to lost data.

Per discussion elsewhere on this PR I am no longer convinced this comment is correct.

bachand · 2022-03-10T23:51:36Z

+
+                // Update the offsetInFileAtEndOfNewestMessage in our header and reader such that we will ignore the now-deleted data in the file.
+                // If the application crashes between writing this header and writing the next message data, we'll have only lost the messages we were about to delete.
+                // If `writer.offsetInFile == FileHeader.expectedEndOfHeaderInFile`, then crashing before we update the header would lead to this now-empty file being viewed as corrupted.


I agree it's preferable to truncate before updating the offset in the file of the end of the newest message.

bachand · 2022-03-11T00:04:22Z

+
+                // Update the offsetInFileAtEndOfNewestMessage in our header and reader such that we will ignore the now-deleted data in the file.
+                // If the application crashes between writing this header and writing the next message data, we'll have only lost the messages we were about to delete.
+                // If `writer.offsetInFile == FileHeader.expectedEndOfHeaderInFile`, then crashing before we update the header would lead to this now-empty file being viewed as corrupted.


Ideally our header would keep track of the last non-garbage byte in the file and we wouldn't need to truncate the file, but that's a much larger lift.

This makes sense though I worry that there's a new chance of getting out of sync. Let's say we started with an empty file and write a message. We'd need to update both the offset of the end of the newest message, AND the offset of the last non-garbage byte in the file. We can't update both of those values atomically. In writing this comment I also realized (or re-realized) that none of our writes are guaranteed to be atomic, which is a scary thought.

bachand

I had another comment I'm chewing on 🤔

bachand · 2022-03-11T02:31:24Z

+
+                // Update the offsetInFileAtEndOfNewestMessage in our header and reader such that we will ignore the now-deleted data in the file.
+                // If the application crashes between writing this header and writing the next message data, we'll have only lost the messages we were about to delete.
+                // If `writer.offsetInFile == FileHeader.expectedEndOfHeaderInFile`, then crashing before we update the header would lead to this now-empty file being viewed as corrupted.


Yes I had considered your approach in (1) for this situation and concluded that it could work. I haven't thought deeply about whether if offsetOfEndOfFile < updateOffsetInFileAtEndOfNewestMessage { /* assume updateOffsetInFileAtEndOfNewestMessage is corrupt and trust offsetOfEndOfFile instead */ } is an accurate assumption in all cases.

bachand · 2022-03-11T03:06:57Z

+                // Update the offsetInFileAtEndOfNewestMessage in our header and reader such that we will ignore the now-deleted data in the file.
+                // If the application crashes between writing this header and writing the next message data, we'll have only lost the messages we were about to delete.
+                // If `writer.offsetInFile == FileHeader.expectedEndOfHeaderInFile`, then crashing before we update the header would lead to this now-empty file being viewed as corrupted.
+                try header.updateOffsetInFileAtEndOfNewestMessage(to: writer.offsetInFile)


I want to dig into this a bit more... Isn't the writer always at the end of the newest message? Under what circumstances would this line of code have an effect?

good point! this line is indeed duplicative.

so... now I'm really getting lost on how our file got into the corrupted state in #52. It's almost as if we wrote an empty message to the file... but that shouldn't be possible?!

bachand · 2022-03-11T03:50:46Z

            // If the application crashes between writing the header and writing the message data, we'll have lost the messages between the previous offsetInFileOfOldestMessage and the new offsetInFileOfOldestMessage.
            try header.updateOffsetInFileOfOldestMessage(to: offsetInFileOfOldestMessage)

-            // Truncate the file if it needs truncation before we write the next message, and after we update our header.


Thinking about this more I'm not sure there is a difference from a data safety POV in truncating the file before or after we update the offset of the start of the oldest message:

If the offset of the start of the oldest message is before the offset of the end of the newest message there's no way we can ever read into the bytes after the end of the newest message.

If the offset of the start of the oldest message and the offset of the end of the newest message are both at the end of the header we will never truncate (see Prevent infinite recursion loop while reading messages #52 (comment)).

We only need to consider the case where the offset of the start of the oldest message is after the offset of the end of the newest message.

For (3) if we truncate immediately (as we are changing this code to do in this PR), we could crash after truncation and before we update the offset of the start of the oldest message. In this situation the offset of the start of the oldest message is in the truncation region. When the file was read again we'd have an empty read and go back to the beginning of the file.

For (3) if we truncate after updating the offset of the start of the oldest message we could crash before truncating. In this situation we'd have non-0 bytes at the end of the file. But the reader will never get to these bytes since they are after the offset of the end of the newest message, like in the first situation.

@dfed do you agree with this reasoning? I'm OK with truncating earlier but I wanted to make sure I understand.

For (3) if we truncate immediately (as we are changing this code to do in this PR), we could crash after truncation and before we update the offset of the start of the oldest message. In this situation the offset of the start of the oldest message is in the truncation region. When the file was read again we'd have an empty read and go back to the beginning of the file.

This is my understanding.

For (3) if we truncate after updating the offset of the start of the oldest message we could crash before truncating. In this situation we'd have non-0 bytes at the end of the file. But the reader will never get to these bytes since they are after the offset of the end of the newest message, like in the first situation.

Agreed.

do you agree with this reasoning?

Yes. And I'm coming to realize that I don't think that #52 or this PR actually reduced (or increased) the likelihood of a file corruption. #52 did help handle the file corruption in a better wya than we were doing before. But I no longer think I understand how we ended up with the file corruption, which isn't great.

* Delete duplicative comment + code * Remove incorrect assertion in comment * Add test for file header length

dfed added 3 commits March 9, 2022 15:56

Fix indentation on test that catches a possible recursion loop

8fd1005

Minimize chance of file corruption

1561830

Bugfix version bump

a137e50

dfed commented Mar 10, 2022

View reviewed changes

dfed requested a review from bachand March 10, 2022 00:14

dfed marked this pull request as ready for review March 10, 2022 00:14

dfed mentioned this pull request Mar 10, 2022

Prevent infinite recursion loop while reading messages #52

Merged

bachand approved these changes Mar 11, 2022

View reviewed changes

dfed merged commit 6bb2434 into main Mar 11, 2022

bachand reviewed Mar 11, 2022

View reviewed changes

dfed mentioned this pull request Mar 11, 2022

Clean up after #52 and #58 #59

Merged

dfed added a commit that referenced this pull request Mar 15, 2022

Clean up after #52 and #58 (#59)

cc069c1

* Delete duplicative comment + code * Remove incorrect assertion in comment * Add test for file header length

dfed deleted the dfed--better-file-corruption-fix branch September 2, 2022 17:37

	guard !previousReadWasEmpty else {
	// If the previous read was also empty, then the file has been corrupted.
	// Two empty reads in a row means that offsetInFileAtEndOfNewestMessage is incorrect.
	// This inconsistency is likely due to a crash occurring during a message write.
	// This issue will not occur in a current version of the library, but an earlier version was capable of creating this corruption.
	throw CacheAdvanceError.fileCorrupted
	}

Uh oh!

Conversation

dfed commented Mar 10, 2022

Uh oh!

dfed Mar 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bachand left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dfed Mar 10, 2022 •

edited

Loading