KAFKA-16368: Update defaults for LOG_MESSAGE_TIMESTAMP_AFTER_MAX_MS_D… #18106

jayteej · 2024-12-09T14:36:33Z

As per KIP-1030, update default value for num.recovery.threads.per.data.dir to 2 and default value for message.timestamp.after.max.ms to 1 hour.

Updated system tests and documentation.

divijvaidya · 2024-12-09T21:23:46Z

Thank you for making this change @jayteej . Can you please add an entry similar to

kafka/docs/upgrade.html

Line 25 in 77ac31b

    
           <li>A number of deprecated classes, methods, configurations and tools have been removed.

about change the default values as well? You can test the documentation change by following https://cwiki.apache.org/confluence/display/KAFKA/Setup+Kafka+Website+on+Local+Apache+Server

jayteej · 2024-12-10T14:48:48Z

@divijvaidya done

server-common/src/main/java/org/apache/kafka/server/config/ServerLogConfigs.java

core/src/test/scala/unit/kafka/tools/DumpLogSegmentsTest.scala

docs/upgrade.html

divijvaidya · 2025-01-09T14:49:58Z

I looked at the failing tests. The root cause is that the streams related tests spin up a EmbeddedCluster . This cluster initializes time as "mockTime". MockTime doesn't increment unless you manually increment it. But rest of the test uses actual time by using Thread.Sleep. Hence cluster's time is behind the actual time when test is executing.

They can be fixed with the following change:

-    public static final EmbeddedKafkaCluster CLUSTER = new EmbeddedKafkaCluster(NUM_BROKERS);
+    public static final EmbeddedKafkaCluster CLUSTER = new EmbeddedKafkaCluster(
+            NUM_BROKERS,
+            Utils.mkProperties(Collections.singletonMap("log.message.timestamp.after.max.ms", String.valueOf(Long.MAX_VALUE))));

…EFAULT and NUM_RECOVERY_THREADS_PER_DATA_DIR_CONFIG

divijvaidya · 2025-01-16T14:28:35Z

The failing ClientIdQuota Test have been failing in trunk as well - https://github.com/apache/kafka/actions/runs/12804428379/job/35699352905

Other tests are successful.

…EFAULT and NUM_RECOVERY_THREADS_PER_DATA_DIR_CONFIG (#18106) Reviewers: Divij Vaidya <diviv@amazon.com>

…EFAULT and NUM_RECOVERY_THREADS_PER_DATA_DIR_CONFIG (apache#18106) Reviewers: Divij Vaidya <diviv@amazon.com>

mjsax · 2025-02-07T20:10:33Z

@jayteej @divijvaidya

This PR did break two of our Kafka Streams system tests. And the changes you put into the KS integrations test where not really "correct". Cf #18830

It's already fixed, so no action needed, but I think it would be ideal to asked for a review from somebody who knows the KS code base (same applies to other component like Connect, ect) when you change stuff in this part of the code. It took us many hours to identify the problem of the failing system test what could have been avoided if we were aware of this PR.

divijvaidya · 2025-02-08T20:57:52Z

@mjsax I am sorry to hear that people had to spend multiple hours trying to fix this. I considered this change safe for KS without an expert review because:

the change maintained the configuration prior to this PR by explicitly setting "log.message.timestamp.after.max.ms", String.valueOf(Long.MAX_VALUE). Hence, the assumption was that there is no change at all from test perspective?!
the test was successful after changing the config value to a value prior to the PR.

I will look at why assumption 1 above was not correct in this scenario and improve the gap in my thought process.

Also, I would explore how can we make it easier to for non-experts to change KS code as well, since, I think it would be beneficial for the community if we are able to scale without having to rely on experts for changes to a component.

Separately, do you have any suggestions on how could we have reduced the time to mitigation from hours to minutes here? Where was the most time spent on?

mjsax · 2025-02-09T21:40:32Z

No worries. Just highlighting this so we can (maybe?) do better going forward.

I am sorry to hear that people had to spend multiple hours trying to fix this.

and

Separately, do you have any suggestions on how could we have reduced the time to mitigation from hours to minutes here? Where was the most time spent on?

The problem was not the fix, but to actually identify the commit that introduced the system test failure. We had to basically do a "binary search" over the commit history, and re-run the system test over and over again until we found the commit. After we had the commit, it was rather quick to understand the root cause.

I will look at why assumption 1 above was not correct in this scenario and improve the gap in my thought process.

Well, this PR was ok (even if not ideal) for the integration test, but we don't run system tests on regular PRs, so it was easy to miss. No mistake was made on this front IMHO.

Also, I would explore how can we make it easier to for non-experts to change KS code as well, since, I think it would be beneficial for the community if we are able to scale without having to rely on experts for changes to a component.

Overall I agree to the sentiment, but based on past experience, the Kafka code base is very complex and it's difficult to no rely on experts IMHO. -- Would it have slowed down getting this PR merged significantly (especially that it's a very small change), if you would have asked an expert to take a look? -- In the end, it's always your personal judgment I guess, and I don't think you did anything wrong. I can just share my personal approach, that I would never merge any code changing stuff that is not Kafka Streams, w/o an "component expert" to sign it off the non-KS changes (but maybe that's just me).

To be frank: for this PR in particular, even if you would have got a review from an expert, it might have been missed easily -- I would not want to claim, that I would have seen the issue with the system test, if I would have reviewed this PR (contrary, I would say I would have approved this PR with 90+% probability, w/o catching the issue for the system test...) However, if we were aware of this PR, and we see the "system smoke test" which used the same code failing later, we could have connected the dots w/o the need to "find for the commit" that broke it.

Just want to share my POV. Not sure if we can do better or not. Just want to trigger some discussion, in case somebody has a good idea.

…EFAULT and NUM_RECOVERY_THREADS_PER_DATA_DIR_CONFIG (apache#18106) Reviewers: Divij Vaidya <diviv@amazon.com>

github-actions bot added core Kafka Broker small Small PRs labels Dec 9, 2024

divijvaidya mentioned this pull request Dec 10, 2024

KAFKA-16368: Update default linger.ms to 5 for KIP-1030 #18080

Merged

jayteej force-pushed the KAFKA-16368-LOG_TS_AND_RECOVERY_THREADS branch from e0cf94c to 7ad064c Compare December 10, 2024 14:48

divijvaidya reviewed Dec 10, 2024

View reviewed changes

server-common/src/main/java/org/apache/kafka/server/config/ServerLogConfigs.java Outdated Show resolved Hide resolved

core/src/test/scala/unit/kafka/tools/DumpLogSegmentsTest.scala Show resolved Hide resolved

jayteej force-pushed the KAFKA-16368-LOG_TS_AND_RECOVERY_THREADS branch 2 times, most recently from e853efd to 687ab8d Compare December 11, 2024 12:33

divijvaidya reviewed Dec 11, 2024

View reviewed changes

docs/upgrade.html Show resolved Hide resolved

docs/upgrade.html Show resolved Hide resolved

jayteej force-pushed the KAFKA-16368-LOG_TS_AND_RECOVERY_THREADS branch from 687ab8d to 8cd135b Compare December 11, 2024 15:51

github-actions bot added the storage Pull requests that target the storage module label Dec 11, 2024

jayteej force-pushed the KAFKA-16368-LOG_TS_AND_RECOVERY_THREADS branch from 8cd135b to 4aac68d Compare January 14, 2025 11:18

github-actions bot added the streams label Jan 14, 2025

divijvaidya added the ci-approved label Jan 14, 2025

jayteej force-pushed the KAFKA-16368-LOG_TS_AND_RECOVERY_THREADS branch 2 times, most recently from 7fd8e85 to 179b445 Compare January 15, 2025 09:28

divijvaidya added the kip Requires or implements a KIP label Jan 15, 2025

jayteej force-pushed the KAFKA-16368-LOG_TS_AND_RECOVERY_THREADS branch from 179b445 to 7819e4d Compare January 16, 2025 10:19

KAFKA-16368: Update defaults for LOG_MESSAGE_TIMESTAMP_AFTER_MAX_MS_D…

2a62970

…EFAULT and NUM_RECOVERY_THREADS_PER_DATA_DIR_CONFIG

jayteej force-pushed the KAFKA-16368-LOG_TS_AND_RECOVERY_THREADS branch from 7819e4d to 2a62970 Compare January 16, 2025 11:44

divijvaidya merged commit 23c4592 into apache:trunk Jan 16, 2025
7 of 9 checks passed

divijvaidya pushed a commit that referenced this pull request Jan 16, 2025

KAFKA-16368: Update defaults for LOG_MESSAGE_TIMESTAMP_AFTER_MAX_MS_D…

caf0b67

…EFAULT and NUM_RECOVERY_THREADS_PER_DATA_DIR_CONFIG (#18106) Reviewers: Divij Vaidya <diviv@amazon.com>

pranavt84 pushed a commit to pranavt84/kafka that referenced this pull request Jan 27, 2025

KAFKA-16368: Update defaults for LOG_MESSAGE_TIMESTAMP_AFTER_MAX_MS_D…

393fd85

…EFAULT and NUM_RECOVERY_THREADS_PER_DATA_DIR_CONFIG (apache#18106) Reviewers: Divij Vaidya <diviv@amazon.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-16368: Update defaults for LOG_MESSAGE_TIMESTAMP_AFTER_MAX_MS_D… #18106

KAFKA-16368: Update defaults for LOG_MESSAGE_TIMESTAMP_AFTER_MAX_MS_D… #18106

Uh oh!

jayteej commented Dec 9, 2024

Uh oh!

divijvaidya commented Dec 9, 2024

Uh oh!

jayteej commented Dec 10, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

divijvaidya commented Jan 9, 2025

Uh oh!

divijvaidya commented Jan 16, 2025

Uh oh!

Uh oh!

mjsax commented Feb 7, 2025

Uh oh!

divijvaidya commented Feb 8, 2025

Uh oh!

mjsax commented Feb 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

KAFKA-16368: Update defaults for LOG_MESSAGE_TIMESTAMP_AFTER_MAX_MS_D… #18106

KAFKA-16368: Update defaults for LOG_MESSAGE_TIMESTAMP_AFTER_MAX_MS_D… #18106

Uh oh!

Conversation

jayteej commented Dec 9, 2024

Uh oh!

divijvaidya commented Dec 9, 2024

Uh oh!

jayteej commented Dec 10, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

divijvaidya commented Jan 9, 2025

Uh oh!

divijvaidya commented Jan 16, 2025

Uh oh!

Uh oh!

mjsax commented Feb 7, 2025

Uh oh!

divijvaidya commented Feb 8, 2025

Uh oh!

mjsax commented Feb 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants