Skip to content

restores highly available service#6153

Open
keith-turner wants to merge 1 commit intoapache:mainfrom
keith-turner:highly-avail-service
Open

restores highly available service#6153
keith-turner wants to merge 1 commit intoapache:mainfrom
keith-turner:highly-avail-service

Conversation

@keith-turner
Copy link
Contributor

Restored highly available service in order to support multiple managers. This change should also avoid clients that connect to a non-primary manager from getting stuck. Made all thrift methods throw ThriftNotActiveServiceException, if the method does not declare this it will cause a server side exception. There was a single one way thrift method that could not throw this, so it would still cause server side exceptions if its called. Delaying advertising should prevent this in most cases.

Restored highly available service in order to support multiple managers.
This change should also avoid clients that connect to a non-primary
manager from getting stuck.  Made all thrift methods throw
ThriftNotActiveServiceException, if the method does not declare this it
will cause a server side exception.  There was a single one way thrift
method that could not throw this, so it would still cause server side
exceptions if its called.  Delaying advertising should prevent this in
most cases.
@keith-turner keith-turner added this to the 4.0.0 milestone Feb 25, 2026
@keith-turner
Copy link
Contributor Author

Now that the compaction coordinator can throw ThriftNotActiveServiceException the compactors should retry if they see this exception. Looked at the compactors and it seems like they will retry as all of their calls use RetryableThriftCall.

* lock.
*
* <p>
* Its expected that all methods in the wrapped thrift service declare they throw
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these Highly* classes were copied as is from an earlier commit. The only change I made was adding this comment.

compactionCoordinator.getThriftService(), managerClientHandler, getContext());
wrappedCoordinator, managerClientHandler, getContext());
try {
updateThriftServer(() -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Manager is the only process that passes false for the last argument so that it can delay starting the ThriftServer until after the Manager is fully up. I think with these changes that last parameter can be removed from the method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can look into removing that.

}, false);
// Now that the Manager is up, start the ThriftServer
Objects.requireNonNull(getThriftServerAddress(), "Thrift Server Address should not be null");
getThriftServerAddress().startThriftServer("Manager Client Service Handler");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the suggestion above is implemented, then this line can be removed.

@dlmarion
Copy link
Contributor

I wonder if there is a way to automate a check that all Thrift RPC methods handled by the Manager throws the ThriftNotActiveServiceException

@keith-turner
Copy link
Contributor Author

I wonder if there is a way to automate a check that all Thrift RPC methods handled by the Manager throws the ThriftNotActiveServiceException

Not sure this functionality has any tests. There is the server side aspect of throwing the exception, if we could automate test for that it would be nice. Then there is the client side code that retries when it sees this exception, that code is sprinkled all over the place. The server side code seems easier to test. For the client side code, maybe the best way to test is to run random walk tests while starting/stopping managers.

@keith-turner
Copy link
Contributor Author

keith-turner commented Feb 25, 2026

For the client side code, maybe if we refactored to use more common code on the client side then we could focus on testing that common code. I noticed the compactor code did not need to change because it uses common code for retry. That common code retries on TException though which may be too broad for the general case, like we probably should not retry on a thrift security exception.

@dlmarion
Copy link
Contributor

I looked at trying to write a test for this, but could not figure it out quickly. Another option would be to fail at runtime on the server side, which should be caught during testing. To do this, we could include the following method in the ThriftProcessorTypes class, then call it for each interface passed to the getManagerTProcessor method.

  private void validateHAServerExceptions(Class<?> thriftInterface) {
    String className = thriftInterface.getClass().getName();
    Method[] methods = thriftInterface.getClass().getMethods();
    for (Method m : methods) {
      Class<?>[] exceptionClasses = m.getExceptionTypes();
      if (exceptionClasses.length == 0) {
        throw new IllegalStateException("Method " + m.getName() + " on " + className + ""
            + " does not declare ThriftNotActiveServiceException to be thrown");
      }
      boolean found = false;
      for (Class<?> ec : exceptionClasses) {
        if (ThriftNotActiveServiceException.class.getName().equals(ec.getClass().getName())) {
          found = true;
          break;
        }
      }
      if (!found) {
        throw new IllegalStateException("Method " + m.getName() + " on " + className + ""
            + " does not declare ThriftNotActiveServiceException to be thrown");        
      }
    }
  }

@keith-turner
Copy link
Contributor Author

Re using reflection to look for absence of the exception, thrift one way methods can not throw an exception. Not sure if we can detect if a method is one way. If we could that would be nice in the HAService code, it could just ignore oneway thrift calls when not the primary instead of throwing an exception.

@dlmarion
Copy link
Contributor

It looks like the information that's needed is in the Processor class, specifically a subclass that has the same name as the method, then you call isOneWay on it.

For example, for the ManagerClientService.Iface.waitForFlush method, there is a ManagerClientService.Processor.waitForFlush class, and its isOneWay method returns false.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants