Conversation
Restored highly available service in order to support multiple managers. This change should also avoid clients that connect to a non-primary manager from getting stuck. Made all thrift methods throw ThriftNotActiveServiceException, if the method does not declare this it will cause a server side exception. There was a single one way thrift method that could not throw this, so it would still cause server side exceptions if its called. Delaying advertising should prevent this in most cases.
|
Now that the compaction coordinator can throw ThriftNotActiveServiceException the compactors should retry if they see this exception. Looked at the compactors and it seems like they will retry as all of their calls use RetryableThriftCall. |
| * lock. | ||
| * | ||
| * <p> | ||
| * Its expected that all methods in the wrapped thrift service declare they throw |
There was a problem hiding this comment.
All these Highly* classes were copied as is from an earlier commit. The only change I made was adding this comment.
| compactionCoordinator.getThriftService(), managerClientHandler, getContext()); | ||
| wrappedCoordinator, managerClientHandler, getContext()); | ||
| try { | ||
| updateThriftServer(() -> { |
There was a problem hiding this comment.
The Manager is the only process that passes false for the last argument so that it can delay starting the ThriftServer until after the Manager is fully up. I think with these changes that last parameter can be removed from the method.
There was a problem hiding this comment.
I can look into removing that.
| }, false); | ||
| // Now that the Manager is up, start the ThriftServer | ||
| Objects.requireNonNull(getThriftServerAddress(), "Thrift Server Address should not be null"); | ||
| getThriftServerAddress().startThriftServer("Manager Client Service Handler"); |
There was a problem hiding this comment.
If the suggestion above is implemented, then this line can be removed.
|
I wonder if there is a way to automate a check that all Thrift RPC methods handled by the Manager throws the ThriftNotActiveServiceException |
Not sure this functionality has any tests. There is the server side aspect of throwing the exception, if we could automate test for that it would be nice. Then there is the client side code that retries when it sees this exception, that code is sprinkled all over the place. The server side code seems easier to test. For the client side code, maybe the best way to test is to run random walk tests while starting/stopping managers. |
|
For the client side code, maybe if we refactored to use more common code on the client side then we could focus on testing that common code. I noticed the compactor code did not need to change because it uses common code for retry. That common code retries on TException though which may be too broad for the general case, like we probably should not retry on a thrift security exception. |
|
I looked at trying to write a test for this, but could not figure it out quickly. Another option would be to fail at runtime on the server side, which should be caught during testing. To do this, we could include the following method in the ThriftProcessorTypes class, then call it for each interface passed to the getManagerTProcessor method. |
|
Re using reflection to look for absence of the exception, thrift one way methods can not throw an exception. Not sure if we can detect if a method is one way. If we could that would be nice in the HAService code, it could just ignore oneway thrift calls when not the primary instead of throwing an exception. |
|
It looks like the information that's needed is in the Processor class, specifically a subclass that has the same name as the method, then you call For example, for the |
Restored highly available service in order to support multiple managers. This change should also avoid clients that connect to a non-primary manager from getting stuck. Made all thrift methods throw ThriftNotActiveServiceException, if the method does not declare this it will cause a server side exception. There was a single one way thrift method that could not throw this, so it would still cause server side exceptions if its called. Delaying advertising should prevent this in most cases.