Split into two ranges to read messages if offsetInFileOfOldestMessage > offsetInFileAtEndOfNewestMessage to avoid possible loop by jianjunwoo · Pull Request #72 · dfed/CacheAdvance

jianjunwoo · 2022-11-04T03:57:30Z

My concern is there is still this line of code: try reader.seek(to: FileHeader.expectedEndOfHeaderInFile) , which keeps the possibility to run into an infinite loop.

Thus, I have a thought to fix this for shouldOverwriteOldMessages true | false without the seek back action in this func.

When shouldOverwriteOldMessages is true and new_offset < old_offset , split the reader by two ranges according old message offset, new message offset.
1.1 Range 1 would be : from old message offset to file end.
1.2 Range 2 would be: from top to newest offset
When shouldOverwriteOldMessages is false, the reader read from top to end.
Both of the above situations, we never seek back for next read which can guarantee there is no loop.

@dfed @bachand

This reverts commit 69881ef.

…fsetInFileAtEndOfNewestMessage to avoid possible loop

dfed

I like how you're thinking about this problem! I need to dig in more before I can approve, which I won't have time to do until next week, but I'm excited by this direction 😄. In the meantime, I've left some comments/questions.

dfed · 2022-11-04T04:02:51Z

+        // There is only one range: | `offsetInFileOfOldestMessage` -> `offsetInFileAtEndOfNewestMessage`|
+        if reader.offsetInFileOfOldestMessage < reader.offsetInFileAtEndOfNewestMessage {
+            for encodedMessage in try reader.encodedMessagesFromOffset(reader.offsetInFileOfOldestMessage,
+                                                                       endOffset: reader.offsetInFileAtEndOfNewestMessage) {
+                messages.append(try decoder.decode(T.self, from: encodedMessage))
+            }
+        } else {
+            // In this case, the messages could be split to two ranges
+            // | First Range | (GAP: ignore) | Second Range |

+            // This is second range: | `offsetInFileOfOldestMessage` -> EOF |
+            for encodedMessage in try reader.encodedMessagesFromOffset(reader.offsetInFileOfOldestMessage) {
+                messages.append(try decoder.decode(T.self, from: encodedMessage))
+            }
+
+            // This is first range: | `expectedEndOfHeaderInFile` -> `offsetInFileAtEndOfNewestMessage`|
+            for encodedMessage in try reader.encodedMessagesFromOffset(
+                FileHeader.expectedEndOfHeaderInFile,
+                endOffset: reader.offsetInFileAtEndOfNewestMessage) {
+                messages.append(try decoder.decode(T.self, from: encodedMessage))
+            }
+        }


This is a clever approach! I don't have time to dig to deeply into this PR this week, but I'll do my best to get to it next week. My initial impression is that this approach seems reasonable. At a minimum, I really like the code comment you've written here – they'll really help future maintainers (myself included!).

I like how you're thinking about this problem! I need to dig in more before I can approve, which I won't have time to do until next week, but I'm excited by this direction 😄. In the meantime, I've left some comments/questions.

This approach came in when I read the discussion in PR#64 again. really happy to see this move forward if you feel it right too : D @dfed

dfed · 2022-11-04T04:09:38Z

        XCTAssertEqual(messages, [])
    }

-    func test_messages_whenOffsetInFileAtEndOfNewestMessageIsBeyondEndOfNewestMessageButBeforeEndOfFile_throwsFileCorrupted() throws {


why is this test (and the deleted one below) no longer valid? I think testing these cases is still valuable, but I might be missing something.

These tests should be kept. I reverted the last commit, this was deleted then.

dfed · 2022-11-04T04:10:17Z

-        let message: TestableMessage = "This is a test"
-        let maximumBytes = try requiredByteCount(for: [message])
+    func test_messages_throwsFileCorruptedWhenOffsetInFileAtEndOfNewsetMessageOutOfSync() throws {
+        let randomHighValue: UInt64 = 10_1000


let's make this number less confusing (and fix an old mistake of mine) by doing either:

Suggested change

let randomHighValue: UInt64 = 10_1000

let randomHighValue: UInt64 = 10_000

or:

Suggested change

let randomHighValue: UInt64 = 10_1000

let randomHighValue: UInt64 = 101_000

codecov · 2022-11-04T05:15:08Z

Codecov Report

Merging #72 (27c9f65) into main (3a1ed6f) will increase coverage by 0.07%.
The diff coverage is 96.77%.

@@            Coverage Diff             @@
##             main      #72      +/-   ##
==========================================
+ Coverage   96.36%   96.44%   +0.07%     
==========================================
  Files          14       14              
  Lines         578      591      +13     
==========================================
+ Hits          557      570      +13     
  Misses         21       21

Impacted Files	Coverage Δ
Sources/CacheAdvance/CacheReader.swift	`93.50% <93.93%> (-0.75%)`	⬇️
Sources/CacheAdvance/CacheAdvance.swift	`100.00% <100.00%> (ø)`

jianjunwoo · 2022-11-04T09:01:18Z

+                break
+            }
+        }
+        if let endOffset, offsetInFile != endOffset {


If use maximumOffset set default value for endOffset in the func's parameter, this line need change to offsetInFile < endOffset.
If changed to this, we need to change some test cases, like test_messages_whenOffsetInFileAtEndOfNewestMessageIsBeyondEndOfFile_throwsFileCorrupted. That case won't get error when offsetInFileAtEndOfNewestMessage is beyond EOF.

bachand · 2022-11-05T00:23:36Z

I will dig into this next week as well. Thanks @jianjunwoo !

bachand

Great work @jianjunwoo !

My understanding of this PR is as follows... We aren't currently aware of any type of file corruption that could lead to an infinite loop. At the same time, the way that code is currently written means that an infinite loop remains possible. This PR refactors the library so that an infinite loop is not possible.

It's worth pointing out that so far we haven't (as far as I can recall) acknowledged the two different "modes" of CacheAdvance in the code. The code has been written to transparently handle both situations. There's something elegant with not explicitly acknowledging each case and writing code that will work in either. At the same time, I have found that I spend a lot of time each time I review changes to CacheReader.swift thinking about the different possibilities. I feel like this change makes it easier to consider the edge cases since they are more explicitly captured in code.

I really like that we didn't need to change any tests in this PR. That makes me confident in the change.

Let's give @dfed the time to review as well to make sure he's onboard before we move ahead 👍

bachand · 2022-11-07T22:26:35Z

-            messages.append(try decoder.decode(T.self, from: encodedMessage))
-        }
+        // There is only one range: | `offsetInFileOfOldestMessage` -> `offsetInFileAtEndOfNewestMessage`|
+        if reader.offsetInFileOfOldestMessage < reader.offsetInFileAtEndOfNewestMessage {


When the cache is empty I assume that reader.offsetInFileOfOldestMessage == reader.offsetInFileAtEndOfNewestMessage? I wonder if it would be best to explicitly handle the case of these two values being equal, to make this code easier to reason about.

It would also be nice to validate my assumption that reader.offsetInFileOfOldestMessage == reader.offsetInFileAtEndOfNewestMessage when the cache is empty.

When the cache is empty I assume that reader.offsetInFileOfOldestMessage == reader.offsetInFileAtEndOfNewestMessage

Per this comment block

CacheAdvance/Tests/CacheAdvanceTests/CacheAdvanceTests.swift

Lines 367 to 368 in 3a1ed6f

// up until the current position of the writing handle – which is at the end of the newest persisted message. This algorithm implies that if

// the reading handle and the writing handle are at the same position in the file, then the file is empty. Therefore, when writing a message

when the reading handle and the writing handle point at the same position the file is empty. The reader starts out at header.offsetInFileOfOldestMessage, and the writer starts out at header.offsetInFileAtEndOfNewestMessage. And our reader.offsetInFileOfOldestMessage and reader.offsetInFileAtEndOfNewestMessage should always be set to the same values as those in the header, so your assumption is indeed true. I like the idea of explicitly handling this case.

like this idea too. handle == as empty will make it more clear

bachand · 2022-11-07T22:30:44Z

    ///
-    /// - Parameters:
-    ///   - file: The file URL indicating the desired location of the on-disk store. This file should already exist.
-    ///   - maximumBytes: The maximum size of the cache, in bytes.


bachand · 2022-11-07T22:37:55Z

+    }

+    /// Returns the next encodable message, seeking to the beginning of the next message.
+    func nextEncodedMessage() throws -> Data? {


Can we make this private now?

It could be private, let me do in next commit.
I kept it here to make the diff more readable.

bachand · 2022-11-07T22:51:53Z

-            }
-
-            // We know the next message is at the end of the file header. Let's seek to it.
-            try reader.seek(to: FileHeader.expectedEndOfHeaderInFile)


I agree that the library is safer against unknown unknown errors without this line of code.

dfed

I left a ton of little nit suggestions, but overall I really dig this approach. I appreciate your spending the time to make this library easier to maintain!

I have also validated that the lines not covered by tests in CacheReader.swift are already not covered on main.

dfed · 2022-11-08T01:24:26Z

+        try reader.seek(to: startOffset)
+        while let data = try nextEncodedMessage() {
+            encodedMessages.append(data)
+            if let endOffset = endOffset, offsetInFile >= endOffset {


When offsetInFile > endOffset, wouldn't that mean that the file is corrupted?

EDIT: yup! you covered this case a few lines down 😄

dfed · 2022-11-08T01:26:54Z

+    }

+    /// Returns the next encodable message, seeking to the beginning of the next message.
+    func nextEncodedMessage() throws -> Data? {


I really like that this method is no longer very smart. Our code is more declarative and less imperative now, and the new approach should be simpler to maintain 😄

Co-authored-by: Dan Federman <dfed@me.com>

…nFileAtEndOfNewestMessage

jianjunwoo · 2022-11-08T07:16:13Z

    // MARK: Private

+    /// Returns the next encodable message, seeking to the beginning of the next message.
+    private func nextEncodedMessage() throws -> Data? {


Mark this func as private since it is not used by other files any more.

dfed · 2022-11-09T02:28:09Z

+                }
            }
        }
        if let endOffset = endOffset, offsetInFile != endOffset {


do we need this if anymore given the if within the while loop?

This make no difference compared to the code before this. However, this looks more readable to handle == and > separately. @dfed

if offsetInFile >= endOffset { break }

Ahh. This code is catching the offsetInFile < endOffset case. Maybe we could make that explicit here? Far from necessary, but it could make the intent more clear if I'm reading this right.

Oooh, I misunderstood your question. The if within the while loop is need to check offsetInFile == endOffset to stop at the end.

Ahh. This code is catching the offsetInFile < endOffset case. Maybe we could make that explicit here? Far from necessary, but it could make the intent more clear if I'm reading this right.

I agree it would be nice to make this more clear.

I'll take this in a follow-up PR

bachand

Nice!

bachand · 2022-11-09T03:42:52Z

+            }
+        } else if reader.offsetInFileOfOldestMessage == reader.offsetInFileAtEndOfNewestMessage {
+            // This is an empty cache.
+            return []


bachand · 2022-11-09T03:48:24Z

+                }
            }
        }
        if let endOffset = endOffset, offsetInFile != endOffset {


Ahh. This code is catching the offsetInFile < endOffset case. Maybe we could make that explicit here? Far from necessary, but it could make the intent more clear if I'm reading this right.

I agree it would be nice to make this more clear.

jianjunwoo · 2022-11-10T02:19:26Z

Looks like we've all set about the details, or do we still have any other comments that I missed to discuss?

dfed · 2022-11-10T02:37:56Z

That's it! Thanks again @jianjunwoo for tackling this + making future maintenance of this lib easier! Your contributions have immensely improved this library in recent weeks 🙏

jianjunwoo added 3 commits November 4, 2022 08:47

Revert "Protect against crashes during header write (dfed#66)"

b267dbd

This reverts commit 69881ef.

Split two ranges to read messages if offsetInFileOfOldestMessage > of…

e233c94

…fsetInFileAtEndOfNewestMessage to avoid possible loop

bump version

78f3cb5

dfed reviewed Nov 4, 2022

View reviewed changes

Comment thread CacheAdvance.podspec

jianjunwoo added 2 commits November 4, 2022 13:43

keep tests

b9c8ff7

add comment

f26e921

jianjunwoo force-pushed the public_dev branch from 2c17337 to f26e921 Compare November 4, 2022 07:09

add parameter comments

b3d64ec

jianjunwoo commented Nov 4, 2022

View reviewed changes

dfed reviewed Nov 4, 2022

View reviewed changes

Comment thread Sources/CacheAdvance/CacheReader.swift Outdated

support all the way back to Xcode 11

54e61e2

bachand approved these changes Nov 7, 2022

View reviewed changes

jianjunwoo changed the title ~~Split two ranges to read messages if offsetInFileOfOldestMessage > offsetInFileAtEndOfNewestMessage to avoid possible loop~~ Split into two ranges to read messages if offsetInFileOfOldestMessage > offsetInFileAtEndOfNewestMessage to avoid possible loop Nov 8, 2022

dfed approved these changes Nov 8, 2022

View reviewed changes

Apply suggestions from code review

f623856

Co-authored-by: Dan Federman <dfed@me.com>

jianjunwoo force-pushed the public_dev branch from 37f2e7e to f623856 Compare November 8, 2022 07:09

jianjunwoo added 2 commits November 8, 2022 15:12

Hanle cache file as empty when offsetInFileOfOldestMessage == offsetI…

46de4c5

…nFileAtEndOfNewestMessage

set func nextEncodedMessage private

57b3304

jianjunwoo commented Nov 8, 2022

View reviewed changes

handle offsetInFile > endOffset in func encodedMessagesFromOffset

27c9f65

dfed reviewed Nov 9, 2022

View reviewed changes

bachand approved these changes Nov 9, 2022

View reviewed changes

dfed merged commit 26a1c97 into dfed:main Nov 10, 2022

	let randomHighValue: UInt64 = 10_1000
	let randomHighValue: UInt64 = 10_000

	let randomHighValue: UInt64 = 10_1000
	let randomHighValue: UInt64 = 101_000

	// up until the current position of the writing handle – which is at the end of the newest persisted message. This algorithm implies that if
	// the reading handle and the writing handle are at the same position in the file, then the file is empty. Therefore, when writing a message

Uh oh!

Conversation

jianjunwoo commented Nov 4, 2022

Uh oh!

dfed left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jianjunwoo Nov 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov Bot commented Nov 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bachand commented Nov 5, 2022

Uh oh!

bachand left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dfed Nov 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dfed left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jianjunwoo Nov 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

jianjunwoo Nov 4, 2022 •

edited

Loading

codecov Bot commented Nov 4, 2022 •

edited

Loading

dfed Nov 8, 2022 •

edited

Loading

jianjunwoo Nov 9, 2022 •

edited

Loading