Skip to content

Conversation

@jayteej
Copy link
Contributor

@jayteej jayteej commented Dec 9, 2024

As per KIP-1030, update default value for num.recovery.threads.per.data.dir to 2 and default value for message.timestamp.after.max.ms to 1 hour.

Updated system tests and documentation.

@github-actions github-actions bot added core Kafka Broker small Small PRs labels Dec 9, 2024
@divijvaidya
Copy link
Member

Thank you for making this change @jayteej . Can you please add an entry similar to

<li>A number of deprecated classes, methods, configurations and tools have been removed.
about change the default values as well? You can test the documentation change by following https://cwiki.apache.org/confluence/display/KAFKA/Setup+Kafka+Website+on+Local+Apache+Server

@jayteej
Copy link
Contributor Author

jayteej commented Dec 10, 2024

@divijvaidya done

@jayteej jayteej force-pushed the KAFKA-16368-LOG_TS_AND_RECOVERY_THREADS branch 2 times, most recently from e853efd to 687ab8d Compare December 11, 2024 12:33
@jayteej jayteej force-pushed the KAFKA-16368-LOG_TS_AND_RECOVERY_THREADS branch from 687ab8d to 8cd135b Compare December 11, 2024 15:51
@github-actions github-actions bot added the storage Pull requests that target the storage module label Dec 11, 2024
@divijvaidya
Copy link
Member

I looked at the failing tests. The root cause is that the streams related tests spin up a EmbeddedCluster . This cluster initializes time as "mockTime". MockTime doesn't increment unless you manually increment it. But rest of the test uses actual time by using Thread.Sleep. Hence cluster's time is behind the actual time when test is executing.

They can be fixed with the following change:

-    public static final EmbeddedKafkaCluster CLUSTER = new EmbeddedKafkaCluster(NUM_BROKERS);
+    public static final EmbeddedKafkaCluster CLUSTER = new EmbeddedKafkaCluster(
+            NUM_BROKERS,
+            Utils.mkProperties(Collections.singletonMap("log.message.timestamp.after.max.ms", String.valueOf(Long.MAX_VALUE))));

@jayteej jayteej force-pushed the KAFKA-16368-LOG_TS_AND_RECOVERY_THREADS branch from 8cd135b to 4aac68d Compare January 14, 2025 11:18
@jayteej jayteej force-pushed the KAFKA-16368-LOG_TS_AND_RECOVERY_THREADS branch 2 times, most recently from 7fd8e85 to 179b445 Compare January 15, 2025 09:28
@divijvaidya divijvaidya added the kip Requires or implements a KIP label Jan 15, 2025
@jayteej jayteej force-pushed the KAFKA-16368-LOG_TS_AND_RECOVERY_THREADS branch from 179b445 to 7819e4d Compare January 16, 2025 10:19
…EFAULT and NUM_RECOVERY_THREADS_PER_DATA_DIR_CONFIG
@jayteej jayteej force-pushed the KAFKA-16368-LOG_TS_AND_RECOVERY_THREADS branch from 7819e4d to 2a62970 Compare January 16, 2025 11:44
@divijvaidya
Copy link
Member

The failing ClientIdQuota Test have been failing in trunk as well - https://github.com/apache/kafka/actions/runs/12804428379/job/35699352905

Other tests are successful.

@divijvaidya divijvaidya merged commit 23c4592 into apache:trunk Jan 16, 2025
7 of 9 checks passed
divijvaidya pushed a commit that referenced this pull request Jan 16, 2025
…EFAULT and NUM_RECOVERY_THREADS_PER_DATA_DIR_CONFIG (#18106)

Reviewers: Divij Vaidya <diviv@amazon.com>
pranavt84 pushed a commit to pranavt84/kafka that referenced this pull request Jan 27, 2025
…EFAULT and NUM_RECOVERY_THREADS_PER_DATA_DIR_CONFIG (apache#18106)

Reviewers: Divij Vaidya <diviv@amazon.com>
airlock-confluentinc bot pushed a commit to confluentinc/kafka that referenced this pull request Jan 27, 2025
…EFAULT and NUM_RECOVERY_THREADS_PER_DATA_DIR_CONFIG (apache#18106)

Reviewers: Divij Vaidya <diviv@amazon.com>
@mjsax
Copy link
Member

mjsax commented Feb 7, 2025

@jayteej @divijvaidya

This PR did break two of our Kafka Streams system tests. And the changes you put into the KS integrations test where not really "correct". Cf #18830

It's already fixed, so no action needed, but I think it would be ideal to asked for a review from somebody who knows the KS code base (same applies to other component like Connect, ect) when you change stuff in this part of the code. It took us many hours to identify the problem of the failing system test what could have been avoided if we were aware of this PR.

@divijvaidya
Copy link
Member

@mjsax I am sorry to hear that people had to spend multiple hours trying to fix this. I considered this change safe for KS without an expert review because:

  1. the change maintained the configuration prior to this PR by explicitly setting "log.message.timestamp.after.max.ms", String.valueOf(Long.MAX_VALUE). Hence, the assumption was that there is no change at all from test perspective?!
  2. the test was successful after changing the config value to a value prior to the PR.

I will look at why assumption 1 above was not correct in this scenario and improve the gap in my thought process.

Also, I would explore how can we make it easier to for non-experts to change KS code as well, since, I think it would be beneficial for the community if we are able to scale without having to rely on experts for changes to a component.

Separately, do you have any suggestions on how could we have reduced the time to mitigation from hours to minutes here? Where was the most time spent on?

@mjsax
Copy link
Member

mjsax commented Feb 9, 2025

No worries. Just highlighting this so we can (maybe?) do better going forward.

I am sorry to hear that people had to spend multiple hours trying to fix this.

and

Separately, do you have any suggestions on how could we have reduced the time to mitigation from hours to minutes here? Where was the most time spent on?

The problem was not the fix, but to actually identify the commit that introduced the system test failure. We had to basically do a "binary search" over the commit history, and re-run the system test over and over again until we found the commit. After we had the commit, it was rather quick to understand the root cause.

I will look at why assumption 1 above was not correct in this scenario and improve the gap in my thought process.

Well, this PR was ok (even if not ideal) for the integration test, but we don't run system tests on regular PRs, so it was easy to miss. No mistake was made on this front IMHO.

Also, I would explore how can we make it easier to for non-experts to change KS code as well, since, I think it would be beneficial for the community if we are able to scale without having to rely on experts for changes to a component.

Overall I agree to the sentiment, but based on past experience, the Kafka code base is very complex and it's difficult to no rely on experts IMHO. -- Would it have slowed down getting this PR merged significantly (especially that it's a very small change), if you would have asked an expert to take a look? -- In the end, it's always your personal judgment I guess, and I don't think you did anything wrong. I can just share my personal approach, that I would never merge any code changing stuff that is not Kafka Streams, w/o an "component expert" to sign it off the non-KS changes (but maybe that's just me).

To be frank: for this PR in particular, even if you would have got a review from an expert, it might have been missed easily -- I would not want to claim, that I would have seen the issue with the system test, if I would have reviewed this PR (contrary, I would say I would have approved this PR with 90+% probability, w/o catching the issue for the system test...) However, if we were aware of this PR, and we see the "system smoke test" which used the same code failing later, we could have connected the dots w/o the need to "find for the commit" that broke it.

Just want to share my POV. Not sure if we can do better or not. Just want to trigger some discussion, in case somebody has a good idea.

manoj-mathivanan pushed a commit to manoj-mathivanan/kafka that referenced this pull request Feb 19, 2025
…EFAULT and NUM_RECOVERY_THREADS_PER_DATA_DIR_CONFIG (apache#18106)

Reviewers: Divij Vaidya <diviv@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-approved core Kafka Broker kip Requires or implements a KIP small Small PRs storage Pull requests that target the storage module streams

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants