-
Notifications
You must be signed in to change notification settings - Fork 14.9k
KAFKA-16368: Update defaults for LOG_MESSAGE_TIMESTAMP_AFTER_MAX_MS_D… #18106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KAFKA-16368: Update defaults for LOG_MESSAGE_TIMESTAMP_AFTER_MAX_MS_D… #18106
Conversation
|
Thank you for making this change @jayteej . Can you please add an entry similar to Line 25 in 77ac31b
|
e0cf94c to
7ad064c
Compare
|
@divijvaidya done |
server-common/src/main/java/org/apache/kafka/server/config/ServerLogConfigs.java
Outdated
Show resolved
Hide resolved
e853efd to
687ab8d
Compare
687ab8d to
8cd135b
Compare
|
I looked at the failing tests. The root cause is that the streams related tests spin up a EmbeddedCluster . This cluster initializes time as "mockTime". MockTime doesn't increment unless you manually increment it. But rest of the test uses actual time by using Thread.Sleep. Hence cluster's time is behind the actual time when test is executing. They can be fixed with the following change: |
8cd135b to
4aac68d
Compare
7fd8e85 to
179b445
Compare
179b445 to
7819e4d
Compare
…EFAULT and NUM_RECOVERY_THREADS_PER_DATA_DIR_CONFIG
7819e4d to
2a62970
Compare
|
The failing ClientIdQuota Test have been failing in trunk as well - https://github.com/apache/kafka/actions/runs/12804428379/job/35699352905 Other tests are successful. |
…EFAULT and NUM_RECOVERY_THREADS_PER_DATA_DIR_CONFIG (#18106) Reviewers: Divij Vaidya <diviv@amazon.com>
…EFAULT and NUM_RECOVERY_THREADS_PER_DATA_DIR_CONFIG (apache#18106) Reviewers: Divij Vaidya <diviv@amazon.com>
…EFAULT and NUM_RECOVERY_THREADS_PER_DATA_DIR_CONFIG (apache#18106) Reviewers: Divij Vaidya <diviv@amazon.com>
|
This PR did break two of our Kafka Streams system tests. And the changes you put into the KS integrations test where not really "correct". Cf #18830 It's already fixed, so no action needed, but I think it would be ideal to asked for a review from somebody who knows the KS code base (same applies to other component like Connect, ect) when you change stuff in this part of the code. It took us many hours to identify the problem of the failing system test what could have been avoided if we were aware of this PR. |
|
@mjsax I am sorry to hear that people had to spend multiple hours trying to fix this. I considered this change safe for KS without an expert review because:
I will look at why assumption 1 above was not correct in this scenario and improve the gap in my thought process. Also, I would explore how can we make it easier to for non-experts to change KS code as well, since, I think it would be beneficial for the community if we are able to scale without having to rely on experts for changes to a component. Separately, do you have any suggestions on how could we have reduced the time to mitigation from hours to minutes here? Where was the most time spent on? |
|
No worries. Just highlighting this so we can (maybe?) do better going forward.
and
The problem was not the fix, but to actually identify the commit that introduced the system test failure. We had to basically do a "binary search" over the commit history, and re-run the system test over and over again until we found the commit. After we had the commit, it was rather quick to understand the root cause.
Well, this PR was ok (even if not ideal) for the integration test, but we don't run system tests on regular PRs, so it was easy to miss. No mistake was made on this front IMHO.
Overall I agree to the sentiment, but based on past experience, the Kafka code base is very complex and it's difficult to no rely on experts IMHO. -- Would it have slowed down getting this PR merged significantly (especially that it's a very small change), if you would have asked an expert to take a look? -- In the end, it's always your personal judgment I guess, and I don't think you did anything wrong. I can just share my personal approach, that I would never merge any code changing stuff that is not Kafka Streams, w/o an "component expert" to sign it off the non-KS changes (but maybe that's just me). To be frank: for this PR in particular, even if you would have got a review from an expert, it might have been missed easily -- I would not want to claim, that I would have seen the issue with the system test, if I would have reviewed this PR (contrary, I would say I would have approved this PR with 90+% probability, w/o catching the issue for the system test...) However, if we were aware of this PR, and we see the "system smoke test" which used the same code failing later, we could have connected the dots w/o the need to "find for the commit" that broke it. Just want to share my POV. Not sure if we can do better or not. Just want to trigger some discussion, in case somebody has a good idea. |
…EFAULT and NUM_RECOVERY_THREADS_PER_DATA_DIR_CONFIG (apache#18106) Reviewers: Divij Vaidya <diviv@amazon.com>
As per KIP-1030, update default value for num.recovery.threads.per.data.dir to 2 and default value for message.timestamp.after.max.ms to 1 hour.
Updated system tests and documentation.